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Executive  Summary 

During  the  course  of  this  project  we  made  significant  progress  in  four  focus  areas  toward 
advancing  the  state-of-the-art  in  learning  control  theory  of  robust  and  adaptive  non¬ 
equilibrium  control  of  highly  nonlinear,  higher-order,  reconfigurable  systems: 

1.  Extend  Approximate  Dynamic  Programming  (ADP)  techniques  to  control  of 
nonlinear,  multiple  time  scale,  non-affme  systems  in  an  Adaptive  Control 
framework. 

2.  Develop  solution  techniques  for  Markov  Decision  Problems  (MDP)  that  scale  to 
continuous  state  and  control  spaces  with  constraints. 

3.  Extend  MDP  techniques  to  solve  multi-agent  co-ordination  and  control  problems 
in  a  decentralized  fashion. 

4.  Develop  solution  techniques  that  scale  to  continuous  state-space  Partially 
Observable  Markov  Decision  Problems  (POMDP)  and  their  multi-agent 
generalizations. 

Progress  Reporting 

Our  work  is  presented  as  journal  and  conference  papers,  with  an  introduction  to  each 
paper  and  the  actual  paper  included  below.  The  work  conducted  by  John  Valasek  and  his 
students  is  presented  first,  followed  by  the  work  conducted  by  Suman  Chakravorty  and 
his  students. 
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TECHNICAL  SUMMARIES  -  JOHN  VALASEK 


2011: 

1.  Siddarth,  Anshu  and  John  Valasek.  2011.  Kinetic  State  Tracking  for  a  Class  of 

Singularly  Perturbed  Systems.  Journal  of  Guidance,  Control,  and  Dynamics, 
Vol.  34,  No. 3,  pp.  734-749. 

The  benefit  of  singular  perturbation  theory,  specifically  the  Tikhonov  theorem  in 
flight  control  is  at  the  level  of  modeling  where  it  is  used  as  a  model  reduction 
technique.  This  paper  considers  a  class  of  nonlinear  flight  control  problems  that 
do  not  satisfy  the  underlying  conditions  of  the  Tikhonov  theorem.  Using  insights 
from  geometric  singular  perturbation  theory  and  concept  of  center  manifolds,  this 
paper  formulates  a  modified  composite  control  law  scheme  that  retains  the 
benefits  of  the  Tickhonov  theorem  for  this  general  class  of  nonlinear  problems. 
The  main  contribution  of  the  developed  result  is  that  it  is  independent  of  the  time 
scale  separation  between  the  translational  and  rotational  dynamics. 

2.  Siddarth,  Anshu  and  Valasek,  John.  2011.  Global  Tracking  Control  Structures 
for  Nonlinear  Singularly  Perturbed  Aircraft  Systems,  in  Advances  in 
Aerospace  Guidance,  Navigation  &  Control,  Florian  Holzapfel  and  Stephan  Theil, 
Eds,  Springer  Berlin  Heidelberg,  DOI:  10.1007/978-3-642-19817-5  19,  pp  235- 
246. 

This  paper  is  concerned  with  guaranteeing  asymptotic  tracking  of  both  slow  and 
fast  dynamics  of  a  nonlinear  system  -  applications  of  which  provides  departure 
resistance  capability  for  unmanned  aerial  vehicles.  The  synthesized  controller 
takes  advantage  of  singular  perturbation  theory  and  feedback  linearization  results 
to  establish  global  exponential  stability.  The  performance  is  demonstrated  for  a 
nonlinear,  coupled,  six  degree-of- freedom  F/A-18. 

3.  Siddarth,  Anshu  and  Valasek,  John.  2011.  Output  Tracking  of  Non-Minimum 
Phase  Systems.  AIAA-20 11-6487,  Proceedings  of  the  2011  AIAA  Guidance, 
Navigation,  and  Control  Conference,  Portland,  Oregon,  9  August  2011. 

The  main  challenge  in  synthesizing  a  controller  for  a  tail-controlled  vehicle  is  to 
achieve  maximum  performance  it  is  capable  of,  without  exciting  the  unstable 
internal  dynamics.  This  paper  revisits  the  problem  and  synthesizes  a  globally 
stable  full-state  feedback  controller  for  desired  output  tracking.  No  modification 
to  the  output/or  placement  of  a  virtual  inertial  measurement  unit  is  placed.  The 
underlying  approach  is  to  induce  a  time  scale  separation  between  the  output 
dynamics  and  the  unstable  internal  dynamics.  The  main  contributions  of  the 
synthesis  are  analytic  conditions  under  which  the  time  scale  separation  can  be 
induced,  and  stability  and  robustness  of  the  controller  guaranteed. 
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2012: 

4.  Narang-Siddarth,  Anshu  and  Valasek,  John.  2012.  Tracking  Control  Design  for 
Non-Standard  Nonlinear  Singularly  Perturbed  Systems.  WeA06.6, 
Proceedings  of  the  2012  American  Control  Conference,  Montreal,  Canada,  27 
June  2012. 

In  this  paper  novel  state-feedback  control  laws  are  developed  for  a  general  class 
of  two  time  scale  continuous  time  systems  which  are  nonlinear  in  both  slow  and 
fast  states.  No  assumption  concerning  the  type  of  non-linearity  of  the  system  is 
made  and  the  technique  is  applicable  to  both  standard  and  non-standard  forms  of 
singularly  perturbed  systems.  Non-standard  forms  of  singularly  perturbed  systems 
violate  fundamental  conditions  for  Tikchonov  theorem  -  hindering  use  of  model 
reduction  for  control  synthesis.  Asymptotic  stabilization  of  the  proposed  ‘indirect 
manifold  construction  approach’  is  proven  using  Lyapunov  methods.  Results 
show  that  the  proposed  technique  applies  both  to  standard  and  non-standard  forms 
and  guarantees  asymptotic  stabilization  with  quantifiable  robustness  bound  for  the 
singular  perturbation  parameter. 

5.  Valasek,  John,  Kirkpatrick,  Kenton,  and  May,  James,  "Intelligent  Motion  Video 
Guidance  for  Unmanned  Air  System  Ground  Target  Surveillance,"  AIAA 
2012-2587,  Proceedings  of  the  2012  AIAA  Infotech@Aerospace  Conference, 
Garden  Grove,  CA  21  June  2012. 

This  paper  develops  an  algorithm  for  surveillance  of  ground  targets  by  UAS  with 
fixed  pan  and  tilt  cameras,  in  the  presence  of  winds.  The  specific  Reinforcement 
Learning  algorithm  used  is  Q-leaming,  and  the  objective  of  the  approach  is  to 
bring  any  target  located  in  an  image  captured  by  a  camera  into  the  center  of  the 
image  using  the  learned  control  policy.  The  learning  agent  determines  offline 
(initially)  how  to  control  the  UAS  and  camera  to  get  a  target  from  any  point  in  the 
image  to  the  center  and  hold  it  there.  A  feature  of  this  approach  is  that  the 
learning  agent  will  continue  to  learn  and  refine  and  update  the  previously  offline 
learned  control  policy,  during  actual  operation. 

6.  Dunn,  Caroline,  Kirkpatrick,  Kenton,  and  Valasek,  John,  "Unmanned  Air 
System  Search  and  Localization  Guidance  Using  Reinforcement  Learning," 

AIAA  2012-2589,  Proceedings  of  the  2012  AIAA  Infotech@Aerospace 
Conference,  Garden  Grove,  CA  21  June  2012. 

This  paper  investigates  aircraft  flight  path  guidance  for  search  and  localization  of 
Regions  of  Interest,  consisting  of  atmospheric  phenomena.  The  problem  is  posed 
as  an  offline  agent  learning  problem,  of  localizing  atmospheric  thermal  locations 
and  then  guiding  an  Unmanned  Air  Vehicle  to  soar  from  one  to  another.  Q- 
leaming  is  used  as  the  learning  algorithm.  The  computational  navigation  solution 
used  here  is  a  basic  grid  algorithm  that  assigns  thermal  locations  and  intensities, 
with  the  representation  being  specified  states,  actions,  goals,  and  rewards  that  are 
used  to  accomplish  the  agent  learning. 
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7.  Narang-Siddarth,  Anshu  and  Valasek,  John.  2012.  Tracking  Control  for  a  Non- 
Minimum  Phase  Autonomous  Helicopter.  AIAA  2012-4453,  Proceedings  of  the 
2012  AIAA  Guidance,  Navigation  &  Control  Conference.  Minneapolis,  MN,  13 
August  2012. 

This  paper  presents  an  application  of  authors’  prior  work  on  control  of  singularly 
perturbed  systems  (2012,  American  Control  Conference)  to  an  autonomous 
helicopter  to  mitigate  non-minimum  phase  behavior.  Fundamentally,  this  behavior 
is  the  technical  term  for  the  helicopter’s  inherent  nature  to  induce  instability  when 
commanded  to  hover  over  a  specific  point  in  space.  This  research  identifies  that 
non-minimum  phase  behavior  is  actually  related  to  non-standard  singularly 
perturbed  systems,  and  stable  control  is  only  possible  if  the  helicopter  exhibits 
standard  singularly  perturbed  behavior.  This  key  inference  allows  implementation 
of  the  ‘indirect  manifold  construction  approach’  control  scheme,  which 
commands  desired  control  action  to  perform  the  required  transformation.  The 
novel  aspect  of  this  implementation  is  that  it  provides  a  mechanism  for  feedback 
gain  selection  in  nonlinear  settings. 

8.  Valasek,  John,  Akella,  Marathi  R.,  Siddarth,  Anshu,  and  Rollins,  Elizabeth.  2012. 
Adaptive  Dynamic  Inversion  Control  of  Linear  Plants  with  Control  Position 
Constraints.  IEEE  Transactions  on  Control  Systems  Technology  ,  Vol.20,  No.4, 
pp  918-933. 

This  paper  is  concerned  with  designing  control  actions  for  a  vehicle  to  follow  a 
prescribed  guidance  solution  under  design  constraints.  To  fully  take  advantage  of 
the  design  constraints,  the  proposed  switching  control  action  gives  priority  to 
safety  over  guidance  when  the  vehicle  is  close  to  instability.  This  switching 
action,  however,  has  been  known  to  induce  instability  when  implemented 
incorrectly.  To  avoid  this  the  paper  introduces  two  novel  concepts:  ‘domain  of 
control  authority’  and  ‘direction  consistent  control  mechanism’  that  together 
identify  boundary  of  control  actions  that  always  guarantee  safety  and  required 
tracking  and  always  restrict  the  commanded  control  action  within  this  boundary. 
Controller  performance  is  demonstrated  with  numerical  examples  of  a  two 
degree-of- freedom  dynamic  model  and  an  F-16XL  aircraft  model. 

2013: 

9.  Kirkpatrick,  Kenton,  and  Valasek,  John,  "Approximation  of  Agent  Dynamics 
Using  Reinforcement  Learning,"  AIAA  2013-0875,  Proceedings  of  the  51st 
AIAA  Aerospace  Sciences  Meeting  Including  the  New  Horizons  Forum  and 
Aerospace  Exposition,  Grapevine,  TX,  9  January  2013. 

In  this  paper,  Reinforcement  Learning-based  algorithms  are  developed  for 
learning  agents'  time  dependent  dynamics  while  also  learning  to  control  them. 
Three  algorithms  are  introduced.  Sampled-Data  Q-leaming  is  an  algorithm  that 
learns  the  optimal  sample  time  for  controlling  an  agent  without  a  prior  model. 
First-Order  Dynamics  Learning  is  an  algorithm  that  determines  the  proper  time 
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constants  for  agents  known  to  have  first-order  dynamics,  while  Second-Order 
Dynamics  Learning  is  an  algorithm  for  learning  natural  frequencies  and  damping 
ratios  of  second-order  systems. 

10.  Narang-Siddarth,  Anshu  and  Valasek,  John.  2013.  A  Constructive  Stabilization 
Approach  for  Open-Loop  Unstable  Non-Affine  Systems.  Proceedings  of  the 
2013  American  Control  Conference,  Washington  D.C,  19  June  2013. 

This  paper  pursues  to  find  a  constructive  feedback  control  strategy  to  address  the 
inherent  nonlinearities  and  complexities  of  underlying  physical  systems  that  form 
the  basis  of  today’s  heterogeneous  applications  (for  example:  flapping-wing 
systems).  Dynamics  of  such  systems  are  nonlinear  in  the  control  input,  and  are 
more  commonly  referred  as  control-nonaffine  systems  in  the  literature.  This 
special  structure  however  violates  collinearity;  a  key  property  essential  to  employ 
nonlinear  control  techniques  such  as  feedback  linearization,  backstepping  and 
Lyapunov-based  redesign  methods.  The  main  contribution  of  this  article  is  that  it 
formulates  a  generalization  of  the  famous  Kalman- Yakubovich-Popov  lemma  for 
non-affine  systems  under  mild  restrictions.  This  new  result  helps  to  determine 
whether  or  not  an  input-output  description  of  a  nonlinear  system  is  passive.  It  is 
expected  that  this  generalization  will  play  a  vital  role  in  developing  adaptive 
control  laws  for  nonlinear  systems  based  on  Lyapunov’s  direct  method  analogous 
to  its  linear  counterpart. 

1 1 .  Narang-Siddarth,  Anshu  and  Valasek,  John.  2013.  Necessary  Conditions  for 
Feedback  Passivation  of  Nonaffine-in-Control  Systems.  Proceedings  of  the 
2013  SIAM  Conference  on  Control  &  Its  Applications. 

This  research  explores  a  universal  stabilization  formula  for  an  unstable  non-affine 
system  by  developing  a  novel  composite  control  law  structure.  The  intuitive  idea 
behind  this  control  form  is  to  introduce  stiffness  and  damping  into  the 
system.  The  major  contribution  comes  in  identifying  principles  that  converts  an 
open-loop  unstable  system  into  stable  in  the  Lyapunov  sense  closed-loop  system 
through  static-feedback  alone.  Most  importantly,  conditions  under  which  a 
nonlinear  system  be  rendered  passive  through  static  state-feedback  without 
making  any  assumptions  about  the  nature  of  the  control  influence  are  developed 
in  this  paper. 

12.  Rollins,  Elizabeth,  Valasek,  John,  Muse,  Jonathan,  and  Bolender,  Michael, 
"Nonlinear  Adaptive  Dynamic  Inversion  Applied  to  a  Generic  Hypersonic 
Vehicle,"  AIAA  2012-5234,  Proceedings  of  the  2013  AIAA  Guidance, 
Navigation,  and  Control  Conference,  Boston,  MA,  22  August  2013. 

Previous  work  on  control  design  for  hypersonic  vehicles  often  uses  linearized  or 
simplified  nonlinear  dynamical  models  of  the  vehicle,  and  very  little  work  has 
been  done  on  recovering  from  unstart  events.  Using  a  generic  hypersonic  vehicle 
as  a  control  design  and  simulation  model,  this  paper  develops  a  nonlinear  adaptive 
dynamic  inversion  control  architecture  with  a  control  allocation  scheme  to  track 
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realistic  flight  path  angle  trajectories.  A  robustness  analysis  is  performed  on  the 
initial  control  architecture  design,  which  shows  that  the  control  architecture  is 
able  to  handle  time  delays,  perturbations  in  stability  derivatives,  and  reduced 
control  surface  effectiveness.  The  control  architecture  then  is  evaluated  for  its 
ability  to  handle  inlet  unstart. 

13.  Henrickson,  James,  Kirkpatrick,  Kenton,  and  Valasek,  John,  "Rapid 
Characterization  of  Shape  Memory  Alloy  Material  Parameters  Using 
Computational  Intelligence  Methods,"  SMASIS2013-3016,  Proceedings  of  the 
15th  ASME  2013  Conference  on  Smart  Materials,  Adaptive  Structures  and 
Intelligent  Systems,  Snowbird,  UT,  16  September  2013. 

Shape  memory  alloys  are  capable  of  delivering  advantageous  solutions  to  a  wide 
range  of  engineering-based  problems.  Implementation  of  these  solutions, 
however,  is  often  complicated  by  the  hysteretic,  non-linear,  thermo-mechanical 
behavior  of  the  material.  Although  existing  shape  memory  alloy  constitutive 
models  are  largely  accurate  in  describing  this  unique  behavior,  they  require  prior 
characterization  of  the  material  parameters.  Consequently,  before  thorough 
modeling  and  simulation  can  occur  for  a  shape  memory  alloy-based  project,  one 
must  first  go  through  the  process  of  identifying  several  material  parameters 
unique  to  shape  memory  alloys.  Current  characterization  procedures  necessitate 
extensive  experimentation,  data  collection,  and  data  processing.  As  a  result,  these 
methods  simultaneously  create  a  high  barrier  of  entry  for  engineers  new  to  active 
materials  and  impede  the  advanced  study  of  shape  memory  alloy  material 
parameter  evolution.  This  paper  develops  a  novel  method  in  which  computational 
intelligence  methods  are  used  to  rapidly  identify  shape  memory  alloy  material 
parameters.  Specifically,  an  artificial  neural  network  is  trained  to  identify 
transformation  temperatures  and  stress  influence  coefficients  of  given  shape 
memory  alloy  specimens  using  strain-temperature  coordinates  as  inputs.  After 
generating  training  data  through  the  use  of  a  constitutive  model,  the  resulting 
trained  artificial  neural  network  was  used  to  identify  parameters  for  a  number  of 
randomly  generated  theoretical  shape  memory  alloys.  Results  show  that  the 
artificial  neural  network  was  able  to  rapidly  identify  both  transformation 
temperatures  and  stress  influence  coefficients  with  satisfactory  accuracy.  The 
generation  of  training  data  was  then  repeated  using  Taguchi  methods.  Further 
results  show  that  the  artificial  neural  network  trained  with  the  Taguchi-based 
training  data  yielded  improved  characterization  accuracy  while  using  less  training 
data. 


2014: 

14.  Woodbury,  Timothy,  Dunn,  Caroline,  and  Valasek,  John,  "Autonomous  Soaring 
Using  Reinforcement  Learning  for  Trajectory  Generation,"  AIAA-2014- 
0990,  Proceedings  of  the  AIAA  Science  and  Technology  Forum  and  Exposition 
2014:  52nd  Aerospace  Sciences  Meeting,  National  Harbor,  MD,  15  January  2014. 
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This  paper  develops  an  approach  for  planar  lateral/directional  guidance  of  a  linear 
dynamic  gliding  aircraft  to  a  known  thermal  location.  Reinforcement  learning  is 
utilized  to  generate  reference  bank  angle  commands  for  directing  the  aircraft  to 
close  proximity  of  the  updraft,  and  from  there  the  aircraft  follows  a  circling 
trajectory  centered  on  the  thermal  to  gain  energy.  A  Lyapunov-based  feedback 
control  law  is  used  to  generate  bank  angle  commands  when  circling  the  thermal. 
By  using  reinforcement  learning  the  problem  of  online  trajectory  generation  is 
reduced  to  a  simple  search  in  a  static  state-action  value  table.  This  approach  has 
the  advantage  of  low  computational  burden/overhead  in  practice.  Furthermore,  the 
need  for  a  precise  aircraft  model  for  learning  and  simulation  is  reduced.  Monte 
Carlo  results  presented  in  the  paper  demonstrate  that  the  reinforcement  learning 
guidance  agent  can  consistently  navigate  the  aircraft  to  the  thermal.  Reliable 
navigation  is  achieved  after  a  relatively  small  number  of  learning  episodes.  An 
analysis  of  typical  energy  gains  circling  a  thermal  of  constant  shape  and  size  is 
also  presented. 

15.  Siddarth,  Anshu,  Peter,  Florian,  Holzapfel,  Florian,  and  Valasek,  John, 
"Autopilot  for  a  Nonlinear  Non-Minimum  Phase  Tail-Controlled  Missile 
Using  Time  Scale  Separation,"  AIAA-2014-1293,  Proceedings  of  the  AIAA 
Science  and  Technology  Forum  and  Exposition  2014:  52nd  Aerospace  Sciences 
Meeting,  National  Harbor,  MD,  16  January  2014. 

Acceleration  control  of  highly  agile,  aerodynamically-controlled  missiles  is  a 
well-known  non-minimum  phase  control  problem.  This  problem  is  revisited  here 
for  a  planar  tail-controlled  generic  missile,  and  a  globally  stable  nonlinear 
autopilot  command  structure  is  synthesized  to  maximize  performance.  For  the 
first  time  the  non-minimum  phase  characteristics  of  the  vehicle  are  addressed  by 
making  no  modification  to  the  output  definition  by  inducing  an  inherent  time 
scale  separation  in  the  closed-loop  dynamics.  Unlike  previous  time  scale  control 
techniques,  results  presented  here  are  based  on  theoretical  advancements  made  in 
control  of  nonlinear  singularly  perturbed  systems.  Conditions  under  which  the 
induced  time  scale  separation  can  be  employed  for  a  stable  autopilot  design  are 
also  discussed.  The  state  feedback  controller  proposed  is  real-time  implementable, 
independent  of  operating  condition  and  desired  output  trajectory.  Simulation 
results  show  that  the  approach  is  able  to  accomplish  perfect  tracking  while 
keeping  all  closed-loop  signals  bounded. 
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The  trajectory-following  control  problem  for  a  general  class  of  nonlinear  multi-input/multi-output  two  time-scale 
system  is  revisited.  While  most  earlier  works  used  singular  perturbation  theory  and  assumed  that  an  isolated  real 
root  exists  for  the  nonlinear  set  of  algebraic  equations  that  constitute  the  slow  subsystem,  here,  two  time-scale  systems 
are  analyzed  in  the  context  of  integral  manifolds.  It  is  shown  that  the  singularly  perturbed  system  has  a  center 
manifold  and,  for  small  values  of  the  slow  state,  an  approximate  solution  of  the  nonlinear  set  of  transcendental 
equations  can  be  computed.  Geometric  singular  perturbation  theory  is  used  as  the  model-reduction  technique,  and 
modified  composite  control  design  is  used  to  formulate  the  stabilizing  control  laws  for  slow  state  tracking.  The  control 
laws  are  independent  of  the  scalar  perturbation  parameter  and  an  upper  bound  for  it,  and  the  closed-loop  error 
signals  are  determined  such  that  uniform  boundedness  of  the  closed-loop  system  is  guaranteed.  Additionally, 
asymptotic  stabilization  is  shown  for  the  nonlinear  regulation  problem.  The  methodology  is  demonstrated  through 
numerical  simulation  of  a  nonlinear  generic  two-degree-of-freedom  kinetic  model  and  a  nonlinear,  coupled,  six- 
degree-of-freedom  model  of  the  F/A-18A  Hornet.  Results  demonstrate  that  the  methodology  permits  close  tracking  of 
a  reference  trajectory  while  maintaining  all  control  signals  within  specified  bounds. 


Nomenclature 


A,Af 

-  positive  gain  matrices 

b 

=  wingspan,  ft 

c 

=  mean  aerodynamic  chord,  ft 

cD 

=  drag  coefficient 

CL 

=  lift  coefficient 

CY 

=  side  force  coefficient 

Ch  Cm,  Cn 

=  roll,  pitch,  and  yaw  moment  coefficients 

D 

=  domain  of  subscripted  variable 

8 

=  gravity  acceleration,  ft/  s2 

CJyJz 

=  principal  axis  inertias  for  aircraft,  slug  ft2 

M 

=  Mach  number 

m 

=  mass,  slug 

m 

=  number  of  slow  variables 

(M0) 

=  nonlinear  map 

Me 

=  invariant  manifold  of  full-order  system 

M0 

=  invariant  manifold  of  reduced-order  subsystems 

n 

=  number  of  fast  variables 

o() 

=  order  symbol 
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Subscripts 

b 

r 


Superscripts 

S 

U 


number  of  control  variables 

body  roll,  pitch,  and  yaw  rates,  deg  / s 

degree  of  smoothness 

reference  area,  ft2 

slow  time  scale 

maximum  thrust,  lb 

initial  time 

some  finite  time;  greater  than  t0 
control  vector 

Lyapunov  function  for  closed-loop  reduced  slow 

subsystem 

speed  of  sound,  ft/s 

vector  [x,  e]T 

Lyapunov  function  for  closed-loop  reduced  fast 
subsystem 

state  variables  of  full-order  system 
tracking  error 

error  between  fast  variable  and  exact  manifold 
angle  of  attack,  deg 
sideslip  angle,  deg 
perturbation  quantities 

elevator,  aileron,  and  rudder  control  inputs,  deg 

scalar  perturbation  parameter 

upper  bound  for  scalar  perturbation  parameter 

(for  stabilization  problem) 

throttle  input 

pitch  attitude  angle,  deg 

wind-axes  orientation  angles,  deg 

Lyapunov  function  for  complete  system 

density  of  air,  slug/ft3 

fast  time  scale 

approximation  of  exact  manifold 
roll  attitude  angle,  deg 
approximation  to  exact  manifold 
heading  angle,  deg 

derivative  with  respect  to  slow  time  scale 
derivative  with  respect  to  fast  time  scale 
Euclidean  norm 


bound  on  variable 
reference 


stable 

unstable 


I.  Introduction 

MATHEMATICAL  modeling  of  many  physical  systems 
requires  high-order  dynamic  equations.  The  presence  of 
parameters  such  as  spring  constant,  mass,  and  moments  of  inertia  are 
the  cause  of  stiffness  and  increased  order  of  these  equations.  It  is 
difficult  to  arrive  at  exact  analytical  solutions  of  these  nonlinear 
governing  equations  with  known,  and  sometimes  unknown,  variable 
coefficients,  so  an  approximate  solution  is  often  computed.  Singular 
perturbation  theory  is  a  scheme  used  to  simplify  systems  that 
inherently  possess  both  fast  and  slow  dynamics.  Such  systems  are 
characterized  by  a  small  parameter  e  multiplying  the  highest 
derivative.  Suppression  of  this  small  parameter  reduces  the  order  of 
the  system,  and  thus  the  label  of  singularly  perturbed.  Singular 
perturbation  theory  dates  back  to  the  1904  work  of  Prandtl  [1]  on 
fluid  boundary  layers;  subsequently,  applications  of  perturbation 
methods  were  explored  for  control  design  [2-4]. 

The  main  contribution  of  perturbation  methods  is  at  the  level  of 
modeling,  where  it  has  been  used  as  a  model-reduction  technique  as 
well  as  a  means  of  removing  the  numerical  stiffness  in  the  original 
system.  In  particular,  the  method  of  matched  asymptotic  expansions 
reduces  the  study  of  the  full-order  system  of  equations  to  the  study  of 


two  other  degenerate  models.  The  first  model  captures  the  dominant 
phenomena,  and  the  neglected  phenomena  is  handled  in  the  second. 
For  the  full-order  system  of  the  form 

x  =  f(x,z,e)  6Z  =  /(X,Z,6)  (1) 

the  lower-order  models  are  developed  to  be  the  following: 

Reduced  slow  subsystem, 

x  =  f(x,z,e)  0  =  /(x,z,e)  (2) 

Reduced  fast  subsystem, 

x  =  0  z'  =  l(x,z,e)  (3) 

where  e  represents  the  scalar  perturbation  parameter,  and '  represents 
the  derivative  with  respect  to  the  fast  time  scale  r  =  (t  —  t0)/e.  It  has 
been  shown  that  the  behavior  of  the  complete  system  of  Eqs.  (1_)  is 
constrained  within  the  0(e)  bound  of  the  reduced  slow  subsystem, 
provided  the  dynamics  of  the  reduced  fast  subsystem  are  stabilizing 
[5].  One  problem  evident  with  the  reduced  slow  subsystem  is  the 
solution  of  the  transcendental  or  algebraic  set  of  equations  for  the  fast 
states  z.  It  is  known  that  there  may  be  many  solutions  satisfying  this 
set  of  equations.  The  standard  singular  perturbation  model  assumes 
that,  in  the  domain  of  interest,  these  solutions  be  isolated  real  roots. 

Tracking  properties  of  standard  singularly  perturbed  systems  were 
first  studied  by  Grujic  [6]  in  1982.  This  work  laid  the  foundations  of 
tracking  theory  in  a  Lyapunov  sense.  Later,  in  1988,  this  work  was 
extended  for  nonlinear  time-varying  singularly  perturbed  systems 
[7].  However,  it  is  assumed  that  separate  controls  are  available  for 
both  the  reduced  slow  and  the  reduced  fast  subsystems,  and  the 
algebraic  set  of  equations  have  a  trivial  solution.  Christofides  et  al.  [8] 
developed  robust  controller  designs  for  systems  with  a  stabilizable 
fast  subsystem,  and  input/output  linearizable  slow  subsystems  with 
input-to-state  stable  inverse  dynamics.  This  work  considered  a 
general  class  of  nonlinear  time- varying  singularly  perturbed  systems 
that  have  dynamics  linear  in  the  fast  states.  Another  approach  to 
tracking  was  presented  by  Heck  [9]  in  1991 .  He  addressed  the  design 
of  sliding-mode  controllers  for  a  class  of  linear  time-invariant 
systems  where  tracking  of  slow  variables  is  desired.  For  both  reduced 
subsystems,  a  sliding-mode  controller  is  designed,  and  a  composite 
of  these  controls  is  then  implemented  on  the  full-order  system.  The 
concept  of  composite  control,  or  designing  separate  controllers,  for 
each  of  the  subsystems  and  then  implementing  their  cumulative  to  the 
full  higher-order  system  was  initiated  by  Suzuki  and  Miura  [10],  and 
since  then,  this  concept  has  been  extensively  used  by  researchers  for 
robust  stabilization  of  systems  with  time-scale  properties  [11-13]. 

In  the  aircraft  literature,  the  rotational  equations  of  motion 
constitute  the  fast  subsystem.  These  equations  are  highly  coupled 
and  nonlinear;  thus,  there  exist  multiple  solutions  for  the  set  of 
nonlinear  algebraic  equations.  Tracking  of  slow  variables  for  these 
systems  is  achieved  by  making  two  key  assumptions.  First,  the 
control  surface  deflections  do  not  affect  the  slow  states.  Second,  the 
fast  variables  are  the  actuators  for  the  slow  subsystem.  Pioneering 
work  in  this  area  was  published  by  Menon  et  al.  [14]  in  1987. 
Reference  [14]  designed  a  flight-test  trajectory  control  system  using 
dynamic  inversion.  The  output  variables  to  be  tracked  were  total 
velocity,  angle  of  attack,  sideslip,  and  altitude.  Once  the  desired 
angular  rates  were  calculated,  the  dynamic  inversion  was  applied  to 
the  fast  subsystem  to  compute  the  aerodynamic  control  surface 
deflections.  This  work  was  extended  to  overactuated  systems  by 
Snell  et  al.  [15].  More  recently,  the  same  concept  has  been  employed 
to  design  longitudinal  windshear  flight-control  laws  [16]  and  for 
control  of  generic  reentry  vehicles  [17]. 

Although  all  of  the  systems  studied  fall  under  the  category  of 
Eqs.  (1_),  different  design  methodologies  have  been  developed  for 
varied  physical  systems,  and  several  different  control  techniques 
have  been  employed.  The  control  laws  developed  for  a  general  form 
of  physical  systems  assume  the  existence  of  a  unique  solution  of  the 
transcendental  equation.  For  general  dynamical  system  models,  the 
existence  of  isolated  roots  for  the  fast  states  is  not  guaranteed. 
Although  aircraft  literature  addresses  this  problem  by  employing 
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assumptions  about  the  plant  model,  there  is  no  general  methodology 
in  the  literature  to  date  to  design  tracking  control  structures  for 
singularly  perturbed  systems  that  are  nonlinear,  both  in  the  slow  and 
the  fast  states.  The  open-loop  study  of  these  systems  has  been  the 
focus  of  the  geometric  singular  perturbation  theory  [18].  This  theory 
has  been  employed  in  the  past  for  transforming  dynamical  systems 
into  singular  perturbation  form  [Eq.  (_!)]  [19,20]  and  to  develop 
reduced-order  models  [21].  Work  by  Sharkey  and  O’Reilly  [22]  used 
this  approach  to  design  stabilizing  control  laws  for  a  special  of  class 
of  singularly  perturbed  systems  wherein  the  control  appears  only  in 
the  fast  dynamics.  The  global  nature  of  the  preceding  stabilization 
results  was  proved  by  Chen  [23]  later  on  in  1998. 

In  this  paper,  the  use  of  geometric  methods  is  extended  to  a  general 
class  of  time- varying  singularly  perturbed  systems  that  are  nonlinear 
in  both  the  slow  and  the  fast  states.  The  problem  of  control  for  this 
general  class  of  singularly  perturbed  systems  is  addressed  for  the  first 
time  in  a  systematic  manner.  The  paper  makes  two  major  contri¬ 
butions.  First,  this  work  is  not  restricted  to  systems  that  have  a  unique 
solution  for  the  nonlinear  algebraic  set  of  equations  of  the  slow 
subsystem.  The  presence  of  multiple  roots  is  accounted  for  by 
proving  that  a  center  manifold  exists  for  the  slow  subsystem.  This 
allows  for  the  incorporation  of  results  from  the  center  manifold  theory 
that  are  helpful  in  obtaining  approximate  roots  of  the  transcendental 
equations.  Tracking  control  laws  are  designed  for  both  the  slow  and 
the  fast  subsystems  to  track  the  desired  reference  and  computed 
approximation,  respectively,  using  a  composite  control  methodology. 
Second,  the  composite  control  law  is  not  a  function  of  the  scalar 
perturbation  parameter,  nor  does  it  require  knowledge  of  it.  This  is  an 
important  consideration  for  systems  such  as  aircraft,  where  quan¬ 
tifying  this  parameter  can  be  difficult.  The  proposed  control  scheme  is 
able  to  guarantee  asymptotic  stabilization  of  states  for  a  general  class 
of  nonlinear  regulation  problems  and  uniform  bounded  stability  for 
the  trajectory -folio wing  problem.  Using  Lyapunov  theory,  a 
conservative  upper  bound  e*  is  derived  for  the  singular  perturbation 
parameter  for  which  these  results  hold.  From  the  stability  analysis,  it 
is  shown  that  this  approach  applies  to  all  classes  of  singularly 
perturbed  systems,  with  tracking  properties  of  standard  singular 
perturbation  models  being  a  special  case.  The  approach  and  meth¬ 
odology  is  demonstrated  with  simulation  examples  for  a  nonlinear 
generic  two-degree-of-freedom  kinetic  model  and  a  nonlinear, 
coupled  six-degree-of-freedom  F/A-18A  Hornet  aircraft. 

The  paper  is  organized  as  follows.  Section  H  describes  the  class  of 
systems  considered  and  formulates  the  control  problem.  Section  III 
presents  the  necessary  concepts  of  geometric  singular  perturbation 
theory  and  motivates  this  work.  Section  IV  makes  an  important 
observation  about  the  existence  of  a  center  manifold  for  the 
singularly  perturbed  system  and  details  the  procedure  to  compute  this 
manifold.  Section  V  develops  the  reduced-order  models  and  for¬ 
mulates  the  tracking  control  laws.  The  proof  of  stability  and  main 
results  are  also  presented  in  this  section.  Numerical  simulations  are 
presented  in  Sec.  VI,  and  conclusions  are  discussed  in  Sec.  VII. 

II.  Problem  Formulation 

The  dynamic  system  considered  is  the  nonlinear  affine  in  the 
control  singularly  perturbed  system,  mathematically  expressed  as 

x  =  f(x,  z)  +  g(x,  z)u  (4) 

ez  =  /(x,  z)  +  k(x,  z)u  (5) 

where  x  e  Rm  is  the  set  of  slow  variables  of  the  system,  z  e  M"  is  the 
vector  of  the  fast  variables,  and  u  e  Rp  is  the  set  of  the  control 
variable.  The  singular  perturbation  parameter  satisfies  0  <  e  1 
and  e  e  R+.  The  vector  fields  f(.),  g(.),  /(.),  and  k(.)  are  such  that 
the  closed-loop  system  is  twice  continuously  differentiable  with 
respect  to  their  arguments.  The  control  objective  is  to  control  the  slow 
state  to  asymptotically  track  a  specified  twice  continuously 
differentiable  time-varying  bounded  trajectory,  or  x(t)  —>  xr(t)  as 
t  ->  oo. 


Remark  1 :  The  functions  g  (x ,  z)  and  k  (x ,  z)  represent  the  control- 
influence  terms,  while  all  other  terms  such  as  inertial  coupling  and 
gravitational  forces  are  all  contained  in  f(x,  z)  and  /(x,  z). 

Remark  2:  For  a  rigid  body,  x  are  the  translational  velocities  while 
z  represents  the  angular  velocities.  The  rotational  dynamics  for  a 
rigid  body  contain  the  nonlinear  inertial  coupling  terms.  The  function 
/(x,  z)  captures  this  nonlinearity  in  the  fast  states. 

III.  Background:  Geometric  Singular 
Perturbation  Theory 

Singular  perturbation  theory  is  a  tool  used  to  obtain  the  reduced- 
order  approximations  of  the  full-order  equations  of  motion,  which 
are  difficult  to  analyze.  The  theory  is  valid  so  long  as  the  parameter  e 
remains  sufficiently  small  and  the  time-scale  behavior  is  preserved. 
The  method  of  matched  asymptotic  expansions  [24]  and  its  variation, 
the  method  of  composite  expansions  [24],  have  been  the  foremost 
methods  employed  to  develop  these  reduced-order  models.  The 
alternative  geometric  approach  describes  the  motion  of  the  full-order 
system  using  the  concept  of  invariant  manifolds.  Both  approaches 
produce  the  exact  same  reduced-order  models  but  with  different 
assumptions  about  the  system.  Asymptotic  methods  assume  that  the 
dynamical  system  possesses  isolated  roots,  while  the  geometric 
approach  is  more  general  and  takes  into  consideration  multiple 
nonisolated  roots  of  nonlinear  systems. 

To  introduce  the  necessary  concepts  of  geometric  singular 
perturbation  theory  for  an  open-loop  dynamical  system,  consider  the 
nonlinear  autonomous  system: 


X  =  f(x,  z) 

(6) 

€Z  =  /(X,  Z) 

(7) 

Note  that  the  following  results  also  apply  to  nonautonomous 
systems.  Equations  (6)  and  (7)  can  be  rewritten  in  the  fast  time  scale 
r  =  (t  —  t0)/e  as 


x '  =  ef(x,  z) 

(8) 

z'  —  l(x,  z) 

(9) 

The  independent  variables  t  and  r  are  referred  to  as  the  slow  and 
the  fast  time  scales,  respectively,  and  Eqs.  (6—9)  (referred  to  as  the 
slow  and  the  fast  systems,  respectively)  are  equivalent  whenever 
e  ^  0.  First,  the  system  is  studied  for  e  =  0.  The  fast  system  reduces 
to  n  dimensions  with  variables  x  as  constant  parameters,  producing 
the  reduced  fast  subsystem, 


x '  =  0 

(10) 

=  /(x,  z) 

(ii) 

On  the  other  hand,  the  order  of  the  slow  system  reduces  to  m 
dimensions  and  results  in  a  set  of  differential-algebraic  equations, 


producing  the  reduced  slow  subsystem, 

X  =  f(x,  z) 

(12) 

0  =  /(x,  z) 

(13) 

The  reduced  slow  system  appears  to  be  a  locally  flattened  vector 
space  of  the  complete  slow  system.  Thus,  the  set  of  points  (x,  z)  e 
Rm  x  Rn  is  expected  to  have  a  Cr  smooth  manifold  M0  of  dimension 
m  inside  the  zero  set  of  function  /(.),  provided  the  functions  f  (.)  and 
/(.)  are  assumed  to  be  Cr . 
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Assumption  1 :  The  functions  f(x,z)  and  /(x,  z)  are  sufficiently 
smooth  so  that  Cr  with  r  >  1 . 

The  requirement  to  be  continuous  and  at  least  once  differentiable 
assures  smoothness  of  the  manifold  A40.  The  flow  on  this  manifold 
evolves  as 

x  =  f[x,  h0(x)]  (14) 

where  h0(x)  is  the  solution  of  the  algebraic  part  [Eq.  (13)]  that 
defines  the  manifold, 

Ad  0:  z  =  h0(x);  x  g  Rm,  zgR"  (15) 

When  viewed  from  the  perspective  of  the  reduced  fast  subsystem, 
the  manifold  Ad0  is  the  set  of  fixed  points  [x,  h0(x)];  therefore,  Ad0  is 
trivially  invariant.  If  every  fixed  point  [x,  h0(x)]  of  the  reduced  fast 
subsystem  is  assumed  to  be  hyperbolic,  then  starting  from  arbitrary 
initial  conditions,  the  flow  will  settle  down  exponentially  fast  onto 
the  manifold,  after  which  the  flow  evolves  according  to  Eq.  (14). 
Equivalently,  the  flow  normal  to  the  manifold  is  faster  than  that 
tangential  to  it.  Such  a  manifold  is  said  to  be  normally  hyperbolic. 
Furthermore,  a  normally  hyperbolic  invariant  manifold  has  local,  Cr 
smooth  stable,  and  unstable  manifolds:  Wfoc(Ad0)  and  (M(}). 
These  manifolds  are  unions  over  all  (x)  in  Ad0  °f  the  local  stable  and 
unstable  manifolds  of  the  reduced  fast  subsystem’s  hyperbolic  fixed 
points  [x,  h0(x)]. 

To  show  these  concepts,  consider  the  following  example.  Let 

x  i  =  —xx  x2  =  —x2  €Z  =  —z  (16) 

so  that  the  reduced  slow  subsystem  is 

x  i  =  — A,  x2  =  —x2  —  z  =  0  (17) 

and  the  reduced  fast  subsystem  is 

x1  =  0  x2  —  0  z'  =  —z  (18) 

The  solution  of  the  algebraic  equation  (17)  is  z  =  0,  which  is  also 
the  fixed  point  of  Eq.  (18).  The  invariant  manifold  is  given  by 
A40:  z  =  0,  which  is  the  complete  XyX2  plane.  The  origin  is  the 
stable  hyperbolic  equilibrium  of  the  reduced  slow  subsystem,  so  any 
trajectory  starting  on  the  manifold  approaches  the  origin  in  forward 
time,  as  seen  in  Fig.  L  Studying  the  reduced  fast  subsystem  suggests 
that,  for  any  point  with  nonzero  initial  condition  z(0),  the  flow 
approaches  normal  to  the  manifold.  Intuitively,  one  may  conclude 
that,  for  initial  conditions  not  on  the  manifold,  the  reduced  fast 
subsystem  describes  the  transition  to  the  manifold,  after  which  the 
system  evolves  according  to  the  reduced  slow  subsystem  (seen  in 
Fig.  2).  Furthermore,  since  all  points  (xl,x2,z)  approach  the 
manifold  at  an  exponential  rate  forward  in  time,  the  complete  space  is 
the  stable  manifold  Ws  (Mo). 

For  the  full-order  system,  similar  inferences  can  be  made.  The 
presence  of  e  in  Eq.  (7)  indicates  that  the  fast  variables  grow  relatively 
faster  than  the  other  states  of  the  system.  If  their  open-loop  system  is 
stabilizing,  these  states  quickly  settle  down  to  their  equilibrium.  The 
other  variables  continue  to  evolve  in  time  with  the  fast  variables  fixed 
by  an  equilibrium  hypersurface.  Mathematically,  3  t*:  t*  >  t0,  after 
which  the  solutions  x(t,  e)  and  z (t,  e)  lie  on  a  distinct  m  dimensional- 
invariant  manifold  Me: 

Me:  z  =  h(x,e);  x  g  !w,  z  e  Rn  (19) 

For  the  system  of  Eqs.  (16),  the  invariant  manifold  continues  to  be 
the  xyx2  plane.  In  addition,  the  family  of  lines  parallel  to  the  z  axes 
still  describe  the  flow  normal  to  the  manifold.  Consider  Fig.  3  to 
study  this  behavior.  To  generate  this  figure,  e  was  chosen  to  be  0.05. 
For  a  fixed  initial  condition,  the  flow  evolves  in  two  parts:  one 
component  along  the  manifold  Me,  which  is  governed  by  the 
reduced  slow  subsystem,  and  the  other  component  in  the  normal 
direction,  for  which  the  flow  is  governed  by  the  reduced  fast 
subsystem.  Points  that  are  already  on  the  manifold  are  seen  to  evolve 
similar  to  the  flow  sketched  in  Fig.  L  Thus,  the  reduced-order  models 


Fig.  1  Reduced  slow  subsystem. 


provide  good  insight  into  the  behavior  of  the  full-order  system.  It  is 
apparent  that  if  the  reduced  fast  subsystem  were  unstable,  then  an 
initial  condition  not  on  the  manifold  would  move  farther  away  in 
time.  For  the  example  considered,  the  manifolds  A40  and  Me  were 
obtained  to  be  identically  equal,  but  this  is  not  generally  the  case  with 
nonlinear  systems. 

The  geometric  constructs  discussed  previously  are  formal 
statements  of  Fenichel’s  persistence  theory  [18].  First,  the  following 
assumptions  about  the  slow  system  are  made: 

Assumption  2:  There  exists  a  set  M0  that  is  contained  in 
{(x,z):  /(x,z)  =  0},  such  that  M0  is  a  compact  boundaryless 
manifold. 

Assumption  3:  M0  is  normally  hyperbolic  relative  to  the  reduced 
fast  subsystem  and,  in  particular,  it  is  required  that  for  all  points 
z  G  A40,  there  are  k  (respectively,  /)  eigenvalues  of  Dzl( 0,  z)  with 
positive  (respectively,  negative)  real  parts  that  are  bounded  away 
from  zero,  where  k  +  l  =  n. 

The  following  theorem  from  Fenichel  [18]  is  for  compact 
boundary  less  manifolds.  Let  the  slow  system  satisfy  Assumptions  1, 
2,  and  3 .  If  €  >  0  is  sufficiently  small,  then  there  exists  a  manifold  Me 
that  is  Cr~]  smooth  locally  invariant  under  the  fast  system  and  Cr~] 
O(e)  close  to  A40.  In  addition,  there  exist  perturbed  local  stable  and 


Fig.  3  Flow  of  system  of  equations  (16)  when  e  ^  0. 
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unstable  manifolds  of  Me,  and  they  are  Cr  0(c)  close,  for  all  r  <  oo,  degree  of  accuracy.  For  functions  0:  Rm  x  I  a  R",  which  are  Cr~x 
to  their  unperturbed  counterparts.  (r  defined  as  in  Assumption  1)  in  the  neighborhood  of  the  origin, 

define 


IV.  Center  Manifold  and  Computation 

Fenichel’s  theorem  [18]  is  a  powerful  tool  to  study  the  behavior  of 
stiff  dynamical  systems.  It  asserts  the  presence  of  an  invariant 
manifold  Me  that  is  G(e)  close  to  M0,  but  it  does  not  provide  the 
procedure  to  compute  the  manifold.  Since  Me  is  invariant  for  some 
t  >  t *,  the  solutions  follow  the  curve  specified  in  Eq.  (19). 
Differentiating  this  expression  with  respect  to  t, 


(Mcf>)(x,  e)  =  e— f[x,  </>(x,  e)]  -  /[x,  </>(x,  <?)]  (26) 

Note  that,  by  Eq.  (21),  (Mh)(x,  e)  =  0. 

The  following  is  the  theorem  from  [25].  Let  0:  Rffl  x  R  a  R” 
satisfy  0(0, 0)  =  0  and  |(M0)(x,  e)|  =  G[C(x,  e)j  for  |x|  -a  0  and 
e^O,  where  C(.)  is  a  polynomial  of  degree  greater  than  one.  Then, 


.  3h  . 
Z  =  9x  X 


(20) 


and  multiplying  Eq.  (20)  with  e  and  substituting  for  x  and  z  from 
Eqs.  (6)  and  (7)  results  in 


e^f[x,  h(x,  e)]  =  l[x,  h(x,  e)] 


(21) 


|h(x,  e)  -  <f>(x,  e)|  =  0[C(x,  6)] 

This  theorem  implies  that  an  approximate  function  0(x,  e)  can  be 
determined  for  small  values  of  x  and  e.  The  condition  0(0, 0)  =  0  is 
to  ensure  that  the  origin  remains  the  fixed  point.  To  demonstrate  the 
procedure,  consider  the  example  from  [25]: 

x  =  xz  +  ax 3  +  bz2x  ez  =  —z  +  cx2  +  dx2z  (27) 


Equation  (21)  is  called  the  manifold  condition.  Note  that 
substituting  e  =  0  in  the  manifold  condition  returns  Eq.  (13),  which 
is  satisfied  by  the  manifold  M0.  To  employ  Fenichel’s  results  [18], 
the  manifold  condition  needs  to  be  solved.  Exact  computation  is 
impossible,  since  solving  this  condition  is  equivalent  to  solving  the 
complete  nonlinear  system.  One  approximate  approach  is  to 
substitute  a  perturbation  expansion  for  h  (x,  e)  =  h0(x)  +  eh^x)  + 
0(e2)  into  Eq.  (21)  and  then  solve  order  by  order  for  h(x,  e).  If  the 
domain  of  interest  is  known,  then  the  implicit  function  theorem  may 
be  employed.  It  is  usually  the  inverse  problem  that  is  encountered, 
which  is  to  find  h(x,  e)  as  a  smooth  function  of  its  arguments.  In  this 
paper,  the  approach  proposed  in  [25]  is  used,  and  the  following 
discusses  its  computation  procedure. 

The  computation  procedure  proposed  in  [25]  has  been  laid  out  for 
dynamical  systems  with  center  manifolds.  For  completeness,  the  first 
step  is  to  check  whether  the  manifold  Me  is  the  center  manifold  of 
the  singularly  perturbed  system.  To  study  this  behavior,  rewrite  the 
fast  system  using  the  technique  called  suspension  [26]  as 


=  ef(x,  z) 

(22) 

€'  =0 

(23) 

=  l(x,  z) 

(24) 

Assume  that  the  origin  is  the  fixed  point  of  the  preceding  system 
that  is  f(0, 0)  =  0  and  1(0,0)  =  0.  Then,  the  perturbed  system 
obtained  by  linearizing  these  equations  about  the  origin  [e  =  0,  x  = 
0,  h(0, 0)  =  0]  is  written  in  compact  form  as 

Aw;  =  f  w  +  FjZ  Az'  =  Lz  +  Lxw  (25) 

where  w  =  [x,  e]T,  Aw,  and  Az  denote  the  perturbation  quantities, 
and  F,FX,L ,  and  L,  are  constant  matrices  of  appropriate  size.  If  all 
eigenvalues  of  F  have  zero  real  parts  while  all  eigenvalues  of  L  have 
negative  real  parts,  then  the  manifold  Me  is  precisely  the  center 
manifold,  and  it  spans  the  generalized  eigenvectors  associated  with 
eigenvalues  with  zero  real  parts.  This  manifold  is  defined  for  all  small 
values  of  the  slow  state  x  and  the  perturbation  parameter  e.  The 
requirement  on  eigenvalues  of  F  supports  the  existence  of  time  scales 
in  the  system,  for  if  the  eigenvalues  were  nonzero,  then  all  states 
would  be  fast  variables,  and  the  system  is  not  singularly  perturbed. 
This  suggests  that  the  eigenvalue  restriction  on  F  is  always  satisfied 
by  systems  with  the  multiple  time- scale  property.  The  other 
requirement  of  negative  eigenvalues  of  L  is  to  ensure  that  the 
trajectories  not  on  the  manifold  approach  it  in  forward  time. 

From  the  preceding  analysis,  h(x,e)  is  known  to  be  the  center 
manifold.  If  the  origin  is  the  fixed  point  of  the  linearized  system,  then 
the  theorem  from  [25]  asserts  that  one  can  approximate  h(x,  e)  to  any 


Linearizing  this  system  about  the  origin, 

Ajc'  =  0  Ac'  =  0  Az'  =  —  l  (28) 

It  is  seen  that  the  system  possesses  a  center  manifold  z  =  h(x,e). 
To  approximate  h,  define 

(M0)(v,  e)  =  €^-[x(/)(x,  e)  +  ax 3  +  b(f)2(x,  6)x]  +  0(v,  e)  —  cx 2 
ox 

—  dx2(j)(x,e)  (29) 

Hence,  if  (j)(x,e)  =  cx2,  then  (M0)(r,e)  =  C9(|v4|  +  |o:4|),  and 
from  the  preceding  theorem,  h(x,  e)  =  cx2  +  (9(|v4|  +  \cxA\).  Since 
the  fast  subsystem  is  stabilizing,  geometric  singular  perturbation 
theory  says  that  stability  of  the  complete  system  can  be  analyzed  by 
studying  the  flow  on  the  manifold  [Eq.  (14)]: 

x  =  (a  +  c)x3  +  bcx5  +  C9(|x5|  +  \cx5\)  (30) 


V.  Control  Formulation  and  Stability  Analysis 

The  central  idea  is  the  following.  If  the  reduced  fast  subsystem  is 
stabilizing  about  the  manifold  A40,  the  complete  system  dynamics 
remain  G(e)  close  to  the  reduced  slow  subsystem.  This  fact  is 
employed  to  develop  a  stable  closed-loop  system.  It  is  proposed  that 
two  separate  stabilizing  controllers  be  designed  for  each  of  the 
subsystems  and  their  composite  be  fed  to  the  complete  system.  It  is 
shown  that,  in  fact,  this  composite  control  uniformly  stabilizes  the 
complete  system.  This  approach  has  been  shown  in  the  literature  to 
guarantee  asymptotic  stability  for  singularly  perturbed  systems  with 
unique  manifolds  A40  [10].  In  the  following  subsections,  control 
laws  for  a  general  class  of  nonlinear  singularly  perturbed  systems  are 
formulated,  and  closed-loop  system  stability  is  analyzed. 

A.  Control  Law  Development 

The  objective  is  to  augment  the  two  time-scale  system  with  state 
feedback  controllers  such  that  the  system  follows  a  specified 
continuous  twice  differentiable  bounded  trajectory  xr(t).  The  first 
step  is  to  transform  the  system  [Eqs.  (4)  and  (5)]  into  a  non- 
autonomous  stabilization  problem.  Define  the  error  signal  as 
x(t)  =  x(t)  —  xr(t)-  Then, 

x  =  f(x  +  xr,  z)  +  g(x  +  xr,  z)u  —  xr  (31) 

ez  =  l(x  +  xr,  z)  +  k(x  +  xr,  z)u  (32) 

The  objective  is  to  seek  the  control  vector  of  the  form  u= 
us  +  uf,  where 
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Uj  -  r,(x,xr,xr)  (33) 

and 

Uf  =  rf(x,z,xr,xr)  (34) 

Substituting  the  controls  into  Eqs.  (31)  and  (32)  produces 

X  =  f(x  +  xr>  z)  +  g(x  +  xr,  z)[r,(i,  xr,  xr) 

+  r/(x,z,xr,xr)]-xr  (35) 

ez  =  /(x  +  xr,  z)  +  k(x  +  xr,  z)[T,(x,  xr,  xr)  +  Tf(x,  z,  xr,  xr)] 

(36) 

Assume  that  the  right-hand  side  of  Eqs.  (35)  and  (36)  is  C2;  that  is, 
the  vector  fields  satisfy  Assumption  1  with  r  =  2.  From  Fenichel’s 
theorem  [18],  it  can  be  concluded  that  there  exists  a  manifold 

M  e:  z  =  h(x,  e,  xr,  xr)  (37) 

that  satisfies  the  manifold  condition, 

ah  ah-  ir_  ir. 

e—  +  e  —  x  =  l[x  +  xr,  h(x,  e,  xr,xr)]  +  k[x 
at  ax 

+  xr,  h(x,  e,  xr,  xr)]r,(x,  xr,  xr )  +  k[x 
+  xr,  h(x,  e,  xr,  xr)]r f[x,  h(x,  e,  xr,  xr),  xr,  xr]  (38) 


With  the  preceding  choice  of  0(x,  xr,  xr,  Ts),  the  exact  manifold  is 
given  as 

h  (x,  e,  xr,  xr)  =  <p(x.  xr,  xr ,  rt)  +  0[C(x,  e,  xr,  xr)] 

Substituting  the  approximate  expression  for  the  manifold  into 
Eqs.  (39)  and  (40), 

x  =  f{x  +  xr,  z  +  <p(x,  xr,  xr,  rs)  +  0[C(x,  e,  xr,  xr)]}  +  g{x 
+  xr,  z  +  c/>(x,  xr,  xr,  T,)  +  <D[C(x,  e,  xr,  xr)]}rs(x,  xr,  xr) 
+  g{x  +  xr,  z  +  0(x,  xr,  xr,  rs) 

+  0[C(x,  e,  xr ,  xr)]}ry(x,  z,  xr,  xr,  r,)  —  xr  (42) 

CL  =  /{x  +  xr,  Z  +  0(x,  xr,  xr,  Ts)  +  (D[C(x,  6,  xr,  xr)]}  +  k{x 
+  xr,  z  +  <p(x,  xr,  xr,  r,)  +  0[C(x,  e,  xr>  xr]})r,(x,  xr,  xr) 

+  k{x  +  xr,  z  +  0(x,  xr,  xr,  Ts) 

+  0\C{x ,  e,  xr ,  xr)]}r >(x,  z,  xr>  xr,  F,) 

9{0  +  e>[C(x,e,xr,xr)]}  d{(p  +  <D[C(x, e,xr,  xr)]}  i 

dt  dx  ’ 

Note  that  is  a  function  of  Ts  due  to  the  choice  of 
0(x,  xr,  xr,  Ts).  The  reduced  slow  and  fast  subsystems  for  the 
system  of  Eqs.  (42)  and  (43)  are  obtained  by  substituting  e  =  0, 
resulting  in  the  reduced  slow  subsystem, 


Note  that  the  manifold  is  time  dependent,  since  the  system  under 
consideration  is  nonautonomous  due  to  the  time  varying  xr(t). 
Define  the  error  between  the  fast  states  and  the  manifold  Me  as 
z  =  z-h(x,e,xr,xr). 

The  transformed  system  with  the  origin  as  the  equilibrium  is 
expressed  as 

x  =  f[x  +  xr,  z  +  h(x,  e,  xr,  xr)]  +  g[x  +  xr,  z 
+  h(x,  e,  xr,  xr)]rs(x,  xr,  xr)  +  g[x  +  xr,  z 
+  h(x,  e,  xr,  xr)]T f[x,  z  +  h(x,  e,  xr,  xr),  xr,  xr]  —  xr  (39) 


CL  —  l[x  +  xr,  Z  +  h(x,  €,  xr,  xr)]  +  k[x  +  Xr,  L 
+  h(x,  e,  xr,  xr)]r,(x,  xr,  xr)  +  k[x  +  xr>  z 
+  h(x,  e,  xr,  xr)]I7[x,  z  +  h(x,  e,  xr,  xr),  xr,  xr] 

ah  ah  • 

—  - 


(40) 


Note  that  the  error  z  =  0  when  the  manifold  condition  is  satisfied. 
It  is  known  that  the  exact  manifold  h(x,  e,  xr,  xr )  is  impossible  to 
compute.  Let  0(x,  xr,  xr,  Ts)  be  an  approximate  manifold  obtained 
using  the  procedure  presented  in  Sec.  IV.  The  approximate  manifold 
is  chosen  to  contain  terms  independent  of  e,  similar  to  the  example 
considered  at  the  end  of  Sec.  IV.  Define 

(M4>)(x,€,Xr,Xr,rs,Tf)=€^+€^X 

—  l[x  +  Xr,  (p(x,  Xr,  Xr,  Ts)] 

—  k[x  +  xr,  (p(x,  xr,  xr,  r.^r.Cx,  xr,  xr) 

—  k(x  +  xr,  0(x,  xr,  xr,  Ts))Tf(x,  0(x,  xr,  xr,  Ts),  xr,  xr )  (41) 


X  =  f[x  +  xr,  z  +  </>(x,  xr,  xr,  r,)]  +  g[x  +  xr,  z 
+  0(x,  Xr>  Xr,  r^jr^x,  Xr,  Xr)  +  g[x  +  Xr,  Z 

+  4>(x,xr,  xr,  r,)]r f(x,z,xr,xr,  rs)  -  xr  (44) 

0  =  /[x  +  Xr,  L  +  0(x,  xr,  xr,  r,)]  +  k[x  +  Xr,  L 
+  4>(x,  xr,  xr,  r.^r.Cx,  xr,  xr)  +  k[x  +  xr,  z 
+  <^(x,  xr,  xr,  r,)]r >(x,  z,  xr,  xr,  rs)  (45) 

and  the  reduced  fast  subsystem, 

x '  =  0  (46) 


l!  =  l[x  +  xr,  z  +  0(x,  xr,  xr,  r,)]  +  k[x  +  xr,  z 
+  (p(x,  xr,  xr,  r,)]rs(x,  xr,  xr)  +  k[x  +  xr,  z 
+  <p(x,  xr,  xr,  r,)]r >(x,  z,  xr,  xr,  rs)  (47) 


In  general,  the  composite  control  approach  first  computes  the 
control  Ts  required  to  maintain  reduced  slow  subsystem  stability  by 
assuming  that  the  fast  states  lie  upon  the  manifold  and  T^  =  0.  In  the 
next  step,  the  control  Tf  is  designed  to  satisfy  two  conditions: 
guarantee  uniform  convergence  of  the  fast  states  onto  the  manifold 
and  remain  inactive  when  the  fast  state  remains  on  the  manifold.  The 
second  condition  is  implemented  to  avoid  affecting  the  conclusions 
drawn  about  the  reduced  slow  subsystem  stability.  In  the  proposed 
control  scheme,  the  second  condition  is  avoided  by  designing  T f 
ahead  of  Ts.  Thus,  design  ry(x,  z,  xr,  xr,  Ts)  as  a  function  of  Ts, 
such  that  Eq.  (47)  is  transformed  into  the  closed-loop  reduced  fast 
subsystem, 


and  let  (M(p)(t,x,  e)  =  G[C(x,  e,  xr,  xr)],  which  depends  on  the 
choice  of  controls  Ts  and  T y.  Furthermore,  the  following  is  assumed: 

Assumption  A:  The  choice  of  controls  Ts  and  Tf  leads  to 
0[C(x,  e  =  0,  xr,  xr)]  =  0. 


z'  =  -Lf(x,z,xr,xr)  +  Kf(z)  (48) 

such  that  —  Lf  (x,  0,  xr,  xr)  +  Kf  (0)  =  0.  With  this  choice  of  Tf  and 
assumptions  about  vector  fields  Lf  and  Kf,  z  =  0  becomes  the 
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unique  root  of  Eq.  (45).  Therefore,  the  reduced  slow  subsystem 
reduces  to 

X  =  f [x  +  Xr,  0(x,  xr,  xr,  r,)] 

+  g[x  +  xr>  4>(x.  xr,  xr>  r^jr^x,  xr,  xr) 

+  g[x  +  xr>  ip(x.  xr,  xr.  r^jr/x,  0,  xr,  xr,  rt)  -  xr  (49) 

The  only  unknown  in  Eq.  (49)  is  Tv;  therefore,  it  may  be  designed  to 
transform  the  reduced  slow  subsystem  into  the  closed-loop  reduced 
slow  subsystem, 

x  = -Fs(x,xr,xr) +Gs(x)  (50) 

and  exact  forms  of  Ty(x,  z,  xr,  xr),  0(x,xr,xr),  and  correspond¬ 
ingly  C(x,  e,  xr,  xr),  can  be  determined  through  Eqs.  (48)  and  (41), 
respectively. 

Remark  3:  In  the  reduced  subsystems  obtained,  z  =  z— 
0(x,  xr,xr)  by  Assumption  A.  Thus,  at  the  implementation  level, 
the  control  is  a  function  of  known  quantities. 

The  complete  closed-loop  system  is  obtained  by  rewriting 
Eqs.  (42)  and  (43)  as 

X  =  f[x  +  xr,  <p(x,  xr,  xr,  rj]  +  g[x 

+  xr,  <p(x,  xr,  xr,  r^jr^x,  xr>  xr)  +  g[x 
+  xr>  4>(x.  xr,  xr,  r^r/x,  o,  xr,  xr,  rt)  -  xr  +  f[x 
+  Xr>  Z  4-  <p(x.  xr,  xr,  r,)]  -  f(x  +  xr,  4>[x,  xr,  xr,  rt)] 

+  {g[x  +  xr,  z  +  0(x,  xr,  xr,  r^)]  -  g[x 
+  Xr,  0(x,  xr,  xr,  rs)]}rs(x,  xr,  xr)  +  g[x  +  xr,  z 
+  <p(x,  xr,  xr,  r,)]r /(x,  z,  xr,  xr,  r,)  -  g[x 
+  xr>  4>(x,  xr,  xr,  or/x,  0,  xr,  xr,  r,)  +  f{x  +  xr,  z 
+  0(x,  Xr,  Xr,  Ts)  +  G[C(x,  €,  Xr,  Xr)]}  —  f[x  +  Xr,  Z 
+  (p(x,  Xr,  Xr,  r,)]  +  (g{x  +  Xr,  Z  +  </>(x,  Xr,  Xr,  rs) 

+  0[C(x,e,  xr,  xr)]}  -  g[x  +  xr,  z, 

+  <p(x,  xr ,  xr,  rj^r^x,  xr,  xr)  +  (g{x  +  xr,  z 
+  0(x,  xr,  xr,  Ts)  +  0[C(x,  6,  xr,  xr)]}  —  g[x  +  xr,  z 
+  <^(x,  xr,  xr,  rjDr/x,  z,  xr,  xr,  r,)  (5i) 

CL  =  l[x  +  xr,  Z  +  0(x,  xr,  xr,  T5)]  +  k[x  +  xr,  z 
+  0(x,xr,x,.,rj)]rj(x,xr,xr)  +  k[x  +  xr,z 
+  0(x,xr,x,.,rJ)]r/(x,z,xr,xr,rJ)  +  /{x  +  xr,z 
+  0(x,  xr,  xr,  rs)  +  0[C(x,  6,  xr,  xr)]}  —  l[x  +  xr,  z 
+  <f>(x,xr,xr,  rs)]  +  (k{x  +  xr,z  +  </>(x,xr,xr,  rs) 

+  C[C(x,  e,  xr,  xr)]}  —  k[x  +  xr,  z 
+  0(x,xr,x,.,rj)])rj(x,xr,xr)  +  (k{x  +  xr,z 
+  0(x,  xr,  xr,  Ts)  +  0[C(x,  e,  xr,  xr)]}  —  k[x  +  xr,  z 
+  0(x,xr,x,.,rJ)])r/(x>z>xr,xr,rJ) 

9{0  +  0[C(x,e,xr,  xr)]}  d{<f>  +  0[C(x,e,xr,xr)]}t  ^ 

3 1  dx  {  ’ 

Using  the  closed-loop  reduced  subsystems  of  Eqs.  (48)  and  (50), 
Eqs.  (51)  and  (52)  become  the  closed-loop  complete  system, 


X  =  — Fs(x,  xr,  xr)  +  Gs(x)  +  f[x  +  xr,  z  +  0(X,  xr,  xr,  T,)] 

-  f[x  +  xr,  (/>(x,  xr,  xr,  rs)]  +  {g[x  +  xr,  z 
+  <p(x,  xr,  xr,  r,)]  -  g[x  +  xr,  <p(x,  xr ,  xr,  r^jjr^x,  xr,  xr) 
+  g[x  +  xr,  z  +  4>(x ,  xr,  xr,  r^jryCx,  z,  xr,  xr,  rs)  -  g[x 
+  xr,  4>(x,  xr,  xr,  rJ)]r/(x,  o,  xr,  xr,  r,)  +  f{x  +  xr>  z 
+  0(x,  xr,  xr,  Ts)  +  (D[C(x,  €,  xr,  xr)]}  —  f[x  +  xr,  z 
+  (p(x,  Xr,  Xr,  r,)]  +  (g{x  +  Xr,  Z  +  (p(x,  Xr,  xr,  rs) 

+  (D[C(x,  e,  xr,  xr)]}  -  g[x  +  xr,  z 
+  <p(x,  Xr,  Xr,  r^r^X,  Xr,  Xr)  +  (g{x  +  Xr,  Z 
+  0(x,  xr,  xr,  T5)  +  0[C(x,  6,  xr,  xr)]}  —  g[x  +  xr,  Z 
+  <Kx,  xr,  xr,  r,)])r f(x,z,xr,  xr,  r,)  (53) 


ei  =  -Lf  (x,  z,  xr,  xr)  +  Kf  (z)  +  l{x  +  xr,  z  +  <p(x ,  xr,  xr,  T,) 

+  0[C(x ,  €,  xr,  xr)]}  -  l[x  +  xr,  Z  +  0(x,  xr,  xr,  TJ] 

+  (k{x  +  xr,  Z  +  0(x,  xr,  xr,  Ts)  +  G[C(x,  €,  xr,  xr)]} 

—  k[x  +  xr,  z  +  0(x,  xr,  xr,  T5)])r^ (x,  xr,  xr)  +  (k{x 
+  xr,  z  +  0(x,  xr,  xr,  Ts )  +  0[C(x,  e,  xr,  xr)]}  —  k[x 
+  xr,  z  +  <p(x,  xr,  xr,  ri)])r/(x,  z,  xr,  xr,  rs) 

a {<t>  +  0[C(x,e,xr,xr)}}  d{<t>  +  0[C(x,e,xr,xr)]}  ~ 

e  dt  dx  ’ 


Remark  4\  If  0(x,  xr,  xr)  is  the  unique  manifold  for  the  complete 
system,  then  the  terms  of  G[C(x,  e,  xr,  xr)]  are  identically  zero,  and 
the  closed-loop  complete  system  in  Eqs.  (53)  and  (54)  takes  the  form 
as  in  [27,28],  which  has  been  proven  to  be  closed-loop  stable. 


B.  Stability  Analysis 

1.  Tracking  Problem 

The  following  theorem  summarizes  the  main  result  of  the  paper. 

Theorem  1 :  Suppose  the  controls  and  are  designed 
according  to  Eqs.  (48)  and  (50)  and  Assumptions  A-I  hold.  Then  for 
all  initial  conditions,  (x,  z)  e  Dx  x  Dz,  the  composite  control  u  = 
us  +  Uf  uniformly  stabilizes  the  nonlinear  singularly  perturbed 
system  in  Eqs.  (4)  and  (5)  for  all  e  <  6*,  where  e*  is  given  by  the 
inequality  equation  (68),  and  the  error  signals  x(t)  and  z(t)  are 
uniformly  bounded  by  Eqs.  (69)  and  (70),  respectively. 

Proof:  Closed-loop  system  stability  is  analyzed  using  the 
composite  Lyapunov  function  approach  [29].  It  is  required  to  prove 
that  the  closed-loop  system  behavior  remains  close  to  the  closed-loop 
reduced  slow  subsystem.  Suppose  that  there  are  Lyapunov  functions 
V(t,  x)  =  \  xTx  and  W(t,  z)  =  ^zrz  for  the  closed-loop  reduced- 
order  models  (50)  and  (48),  respectively,  satisfying  the  following 
eight  assumptions: 

B)  V(t,  x)  is  positive  definite  and  decrescent;  that  is, 

ci||x||2  <  V(t,x)  <  c2||x||2,  x  e  Dx  c  (55) 

C) 


-jA- Fs(x,xr,xr)  +Gs(x)]  <  — «!  ||x  ||2  —  ||x||,  ai>0, 

dx  (56) 

b i  >0 
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D)  There  exists  a  constant  px  >  0,  such  that 
dV 

—  {f [x  +  Xr,Z  +  0(x,Xr,Xr,  rs)]  -  f[x  +  Xr,  4>{x,  Xr,Xr,  r,)]} 
dV 

+  gj-  {g[x  +  xr.  z  +  0(x.  Xr.  T,)]  -  g[x 

dV 

+  Xr,  4>(x,  Xr,  Xr,  r,)]}rj(x,  Xr,  Xr)  +  —  {g[x  +  Xr,  Z 
+  (p(x,  xr,  xr,  r,)]r >(x,  z,  xr,  xr,  rs)  -  g[x 
+  x.^Cx.x^x^r^l^Cx.O.x^x^r,)}  <  ftpIPII  (57) 


derivative  xr,  which  is  known  to  be  bounded  by  the  choice  of  the 
reference  trajectory.  Consider  the  Lyapunov  function  candidate, 

v(t,  x,  z)  =  V(t,  x)  +  W(t,  z)  (63) 


for  the  closed-loop  system  of  Eqs.  (53)  and  (54).  From  the  properties 
of  V  and  W,  it  follows  that  v(t,  x,z)  is  positive  definite  and 
decrescent.  The  derivative  of  v  along  the  trajectories  of  Eqs.  (53)  and 
(54)  is  given  by 


dV  •  1  3W„, 

:9ix+^r 


(64) 


E)  There  exist  constants  p2  >  0,  /J3  >  0,  and  p4  >  0,  such  that 
dV 

—  (f{x  +  xr,  z  +  0(x,  xr ,  xr ,  TJ  +  (9[C(x,  e,  xr ,  xr)]}  -  f[x 
ox 

dV 

+  xr,z  +  4>(x,  xr,  xr,  r,)])  +  —  (g{x  +  xr,  z 

+  <p(x,  xr,  xr,  r,)  +  <D[C(x ,  e,  xn  xr)]}  —  g[x  +  xr,  z 

dV 

+  <p(x,  xr,  xr,  r^r^x,  xr>  xr)  +  —  (g{x  +  xr>  z 
+  0(x,  xr,  xr,  Ts)  +  G[C(x,  e,  xr,  xr)]}  —  g[x  +  xr,  z 
+  0(x,  xr,  xr,  rs)])I7(x,  z,  xr,  xr,  r,)  <  e/S2||x||2 
+  ^3l|x||||z|| +eAI|x||  (58) 

F)  W(t,  z)  is  positive  definite  and  decrescent  scalar  function 
satisfying, 


Substituting  Assumptions  B-I  into  Eq.  (64), 

V  <  — «i||x||2  —  £>! || x ||  +/Ji||x||||2||  +eAI|x||2  +  e^3||x||||z|| 

+  eAI|x|| 

-^||z||2  +  ft||z||2  +  ^||x||||z||  +^7||z||  +PM\ 

+  A II X  IIP  II  (65) 

Collecting  like  terms 

i>  <  -(«i  -  eA)l|x||2  -  (bl  -  e/34)||x||  +  (0,  +  efi3  + 

+  A)l|x||||z|| 

-(y-ft)"zll2-(-^7-^8)PII  (66) 

Rearrange  Eq.  (66)  to  get 


c3||z||2  <  W(t,z)  <  c4||z||2,  z  G  Dz  C  Rn  (59) 


G) 

^[-Lf(x,z,xr,xr)  +  Kf(z)]  < -a2||z||2,  a2  >  0  (60) 

d  z 


H)  There  exist  scalars  P5  >  0,  P6  >  0,  and  P7  >  0,  such  that 


dW 

— -  (l{x  +  xr,  z  +  0(x,  xr,  xr,  Ts)  +  0[C(x,  e,  xr,  xr)]}  -  l[x 
dz 

dW 

+  xr,z  +  <p(x ,  xr,  xr,  r,)])  +  —  (k{x  +  xr,  z 

dz 

+  0(x,  xr,  xr,  T5)  +  0[C(x,  e,  xr,  xr)]}  —  k[x  +  xr,  z 

dW 

+  <p(x,  xr,  xr,  rs)])r,(x,  xr,  xr)  +  —  (k{x  +  xr,  z 
+  0(x,  xr,  xr,  Ts)  +  0[C(x,  e,  xr,  xr)]}  —  k[x  +  xr,  z 
+  0(x,xr,xr,rJ)])r/(x,z,xr,xr,rJ)  <  e&p||2 
+  eAI|x||  ||z||  +  e/M|z||  (61) 


I)  There  exist  constants  /38  —  0  and  y^9  >  0,  such  that 
dW  f  9{0  +  0[C(x,  e,  xr,  xr)]} 


dz 


dt 


,  3 {(/>  +  0[C(x,e,xr,xr)]} 

+  35 


e^sPII +eAI|x||p||  (62) 


Remark  5:  Assumptions  B,  C,  F,  and  G  are  conditions  for 
asymptotic  stability  of  closed-loop  reduced-order  models.  The 
constant  bx  in  Assumption  C  depends  upon  the  bounds  of  the 
specified  trajectory  xr{t )  and  its  derivative  xr.  If  the  control  Ts  is 
designed  to  maintain  asymptotic  stability  of  the  closed-loop  slow 
subsystem,  then  =  0.  Additionally,  Assumptions  D,  E,  H,  and  I 
are  interconnection  conditions  obtained  by  assuming  the  vector  fields 
are  locally  Lipschitz.  The  constants  /34,  /37,  and  /38  appear  due  to  the 
time- varying  nature  of  the  manifold  and  depend  upon  the  bounds  of 
xr(t )  and  its  derivative  xr.  The  constant  /3S  also  depends  upon  the 


v  < 


— (1  —  d)(oii  —  eP2)  2(^i  +  +  ^6  +  P9) 

_PI|J  i(/s1+6/s3  +  /s6  +  /s9) 

-  ||x||{d(Q!i  -ej62)l|x||  -(ej04-fci)} 


'W?-9 


--ft  |i|-(A  +  « 


)'  ° 


<  d  <  1 


(67) 


The  matrix  becomes  negative  definite  when 


(1  —  d)2(ax  —  efd2)  ^7  —  @5^  <  ^  (Pi  +  +  p6  +  P9)2  (68) 

Thus,  there  exists  an  upper  bound  6*  and  upper  bounds  on  the  errors 
xb  and  zb , 


1 


6/34  —  bx 
d(ax  -  ep2) 

_  Pi  +  Ps 
~d(^-p5) 


(69) 


(70) 


for  which 


v  <  0 


(71) 


From  the  Lyapunov  theorem,  it  can  then  be  concluded  that  the 
closed-loop  signals  x  and  z  are  uniformly  bounded  for  all  initial 
conditions  (x,  z)  e  Dx  x  Dz.  Consequently,  the  control  vector  u  = 
us  +  Uf  is  bounded.  Furthermore,  since  the  trajectory  xr(t )  is 
bounded,  the  manifold  h(x,z,xr,xr)  and  the  closed-loop  signals 
x(t)  and  z (t)  are  bounded.  □ 

2.  Special  Case:  Regulation  Problem 

The  following  theorem  summarizes  the  main  result  for  the 
stabilization  problem. 
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Theorem  2:  Suppose  the  controls  us  and  are  designed 
according  to  Eqs.  (48-50),  and  Assumptions  A-I  hold  with  x  =  x 
and  z  =  z.  Then,  for  all  initial  conditions  (x,  z)  e  Dx  x  Dz,  the 
composite  control  u  =  asymptotically  stabilizes  the 

nonlinear  singularly  perturbed  systems  in  Eqs.  (4)  and  (5)  for  all 
e  <  e*,  where  e*  is  given  by  the  inequality  equation  (68)  with  d  =  0. 

Proof.  Note  that,  in  this  case,  the  manifold  h(x,e)  is  not  time 
varying,  and  x  =  x  and  z  =  z.  Since  this  problem  is  autonomous,  the 
decrescent  conditions  on  the  Lyapunov  functions  V  and  W  can  be 
relaxed.  The  constants  f4,  /J7,  and  /38  in  Assumptions  E,  H,  and  I  are 
all  equal  to  zero,  and  the  constant  bl  =  0,  since  xr  =  0  and  xr  =  0. 
With  these  modifications,  Eq.  (67)  is  modified  as 


x||_r  —(o'!— 6^2)  |(^i  +  +  ^6  +  P9) 

4]  5^1+^3+A+A) 


Therefore,  there  exists  an  e*  such  that 

v  <  0  (73) 

where  e*  satisfies  the  inequality  equation  (68)  with  d  =  0.  □ 

Remark  6:  Theorems  1  and  2  depend  upon  the  approximation  of 
the  invariant  manifold,  leading  to  local  results.  If  it  were  possible  to 
obtain  the  expression  of  the  exact  manifold,  these  results  would  be 
valid  globally. 

Remark  7:  Fenichel’s  theorem  [18]  implies  that  the  behavior  of  the 
complete  nonlinear  system  remains  close  to  the  reduced  slow 
subsystem  if  the  reduced  fast  subsystem  is  stable.  Theorems  1  and  2 
state  the  same  result  for  the  closed-loop  singularly  perturbed  system. 

VI.  Numerical  Examples 

A.  Purpose  and  Scope 

The  preceding  theoretical  developments  are  demonstrated  with 
simulation.  The  first  example  is  a  generic  planar  nonlinear  system. 
This  planar  example  enables  the  study  of  the  geometric  constructs, 
which  are  generally  difficult  to  visualize  in  higher-dimension 
problems.  A  step-by-step  procedure  of  controller  development  is 
detailed  for  the  system  to  track  a  desired  slow  kinetic  state.  A 
comparison  between  the  manifold  approximation  and  the  attained 
actual  fast  state  is  made.  The  closed-loop  results  are  studied  for  a 
sinusoidal  time- varying  trajectory  and  the  regulator  problem.  The 
second  example  develops  control  laws  for  a  nonlinear  F/A-18A 
Hornet  model.  The  objective  of  this  example  is  to  test  the 
performance  of  the  controller  for  a  highly  nonlinear,  two  time-scale 
system.  It  is  required  to  perform  a  turning  maneuver  while 
maintaining  zero  sideslip  and  tracking  a  specified  angle-of-attack 
profile. 

B.  Generic  Two-Degree-of-Freedom  Nonlinear  Kinetic  Model 

The  fast  dynamics  are  modified  to  include  an  arbitrarily  chosen 
quadratic  nonlinearity  in  the  fast  state,  and  a  pseudocontrol  term  with 
unit  effectiveness  is  introduced.  For  this  example,  let  and  z  e  R 
represent  the  slow  and  the  fast  states,  respectively.  The  control  u  e  R 
is  developed  to  track  a  desired  smooth  trajectory  xr{t ): 


x  —  — x  -f  (v  -f  0.5)z  ~\~  u 

(74) 

€Z  =  X  —  (X  +  l)z  +  Z2  +  U 

(75) 

The  value  e  =  0.2  is  retained  in  the  modified  model  [26]. 

1.  Controller  Design 

Define  the  errors  x  =  x  —  xr  and  z  =  z  —  h(x,  e,  xr,  xr ),  and 
transform  Eqs.  (74)  and  (75)  into  error  coordinates  equivalent  to 
Eqs.  (39)  and  (40): 


x  =  —(x  +  xr)  +  (x  +  +  0.5)[z  +  h(x,  e,  xr ,  xr)] 

-  xr  +  Ts  +  Tf  (76) 

ez  =  (x  +  xr)  -  (x  +  vr  +  l)[z  +  h(x,  e,  xr,  xr)\ 

+  [z  +  h(x,  e,  xr,  xr)f  +  Ts  +  Tf  (77) 

Rearrange  Eqs.  (76)  and  (77),  dropping  arguments  of  h : 

x  =  —  x  +  xh(.)  +  0.5/*(.)  —  xr  +  xrh(.)  +  (x  +  xr  +  0.5)z  —  xr 

+  r,  +  Tf  (78) 

ez  =  ~{x  +  vr  +  l)z  +  z2  +  2 zh(.)  +  x  +  -  (x  +  vr  +  1  )h(.) 

dh  dh  • 

+  M.)2  +  r,  +  r/-6--6-v  (79) 

Comparing  with  Eqs.  (39)  and  (40), 
f(.)  =  —  x  +  xh(.)  +  0.5  h(.)  —  xr  +  xrh(.)  +  (x  +  xr  +  0.5)z 
g(-)  =  1 

/(.)  =  —(x  +  +  l)z  +  z2  +  2 zh(.)  +  X  +  xr  -  {x  +  xr  +  1  )hf) 

+  K)2 

k(.)  =  1  (80) 

Let  (p(x,xr,xr,Ts )  be  the  approximate  manifold.  Define  the 
manifold  condition: 

d(p  ()(p 

(M<p)(x,xr,xr,  rs,T f)  =  €~k~  +  e^zX  -  x  -  xr  +  (x  +  Vr)0(.) 

J  dt  ox 

+  0(.)  -  0(.)2  -Ts-rf  (81) 

Select 

(p(x,  xr,  xr,  Ts)  =  x  +  xr  +  Ts  (82) 

so  that 

(M(j))(x,xr,xr,  Ts)  =  +  e^x  +  (x  +  xr)(x  +  +  Ts) 

at  ox 

-  0(.)2  -  Tf  (83) 

and 

O  [C{x,  e,  xr,  xr)]  =  ( M(p)(x ,  xr,  xr,  Ts) 

To  design  the  control  Tf,  develop  the  reduced  fast  subsystem 
equivalent  to  Eqs.  (46)  and  (47): 

x'  =  0 

z'  =  —(x  +  xr  +  l)z  +  Z2  +  2z0(.)  +  X  +  xr  —  (v  +  Xr  +  1  )0(.) 

+  0(.)2  +  r,  +  Tf  (84) 

Design 

rf  =  -Afz-  2 Z0(.)  +  (x  +  +  1  )0(.)  -  X  -  xr  -  (j)2  -  Ts  (85) 

where  Af  is  a  feedback  gain.  Then,  the  closed-loop  reduced  fast 
subsystem  becomes 

z '  —  —  (x  +  xr  +  1  +  Af)z  +  z2 

Comparing  with  Eq.  (48), 


(86) 
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L  y(.)  —  (x  +  Xr  +  1  +  Ay)2  Ky(.)  —  ^  (87) 

The  next  step  is  to  determine  the  control  Ts.  Develop  the  reduced- 
order  slow  subsystem  equivalent  to  Eqs.  (44)  and  (45): 

x  =  — x  T-  x0(.)  T-  O.50(.)  —  xr  T-  xr0(.)  T-  (x  T-  xr  T-  0.5)2  —  xr 


+  r,  +  Ty  (88) 

0  =  —  (x  -\-  xr  -\-  1)2  +  22  +  2 20(.)  +  x  +  xr  —  (x  +  xr  +  1)0(.) 

+  0(.)2  +  r,  + 17  (89) 

Substituting  for  Tf  from  Eq.  (85)  in  Eqs.  (88)  and  (89), 

x  =  -2x  +  x0(.)  +  O.50(.)  -  2xr  +  xr0(.)  +  (x  +  xr  +  0.5)2 
—  xr  —  0(.)2  —  220  (.)  +  (x  +  xr  +  1)0(.)  —  Ay2  (90) 

0  =  -(x  +  xr  +  1  +  Ay)  2  +  22  (91) 


Since  2  =  0  is  the  root  of  the  algebraic  solution,  the  reduced  slow 
subsystem  is  obtained  as 

x  =  —2x  +  V0(.)  +  O.50(.)  ~  2xr  +  xr(j){.)  —xr  —  0(.)2 

+  (x  +  vr  +  1)0(.)  (92) 

Substituting  the  expression  for  0(.)  from  Eq.  (82)  in  Eq.  (92), 
x  =  —2x  —  2  xr  —  xr  +  (2x  +  1.5  +  2xr)(x  +  xr  +  Tj) 

-(x  +  ^  +  r,)2  (93) 

Design  T5  as 

Ts  =  —  x  —  xr  +  xr  —  Ax  (94) 

where  A  is  the  feedback  gain.  Thus,  the  resulting  closed-loop  reduced 
slow  subsystem  is 

x  =  —(2  —  2  xr  +  2Axr  +  1.5A  —  2Axr)x  +  (—A2  —  2A)x2 
+  (— 2xr  +  0.5ir  +  2  XyXy  —  i:2)  (95) 

where  A  is  the  feedback  gain.  Comparing  Eq.  (95)  with  Eq.  (50), 


Fs(.)  =  (2  —  2xr  +  2Axr  +  1.5A  —  2Axr)x  —  (— 2xr  +  0.5xr 
+  2xrxr  —  x2) 

G.(.)  =  (—A2  -  2A)x2  (96) 

Note  that  this  control  only  guarantees  bounded  tracking  for  the 
slow  subsystem.  To  implement  the  control  laws,  substitute  for  Ts 
from  Eq.  (94)  into  Eq.  (82): 

0  =  xr  —  Ax  (97) 

and  use  Eqs.  (85)  and  (97): 

T y  =  (—A2  —  A)x2  +  x(xr  +  2Axr  —  Axr)  +  2Ax  2  —  2(2xr  +  Ay) 

—  x2  +  xrxr  (98) 

Recall  that  these  controllers  are  designed  using  the  reduced- 
order  subsystems  2  =  z  —  0(.)>  where  0(.)  is  given  by  Eq.  (97). 
Then,  the  control  laws  Ts  and  Ty  can  be  expressed  in  original 
coordinates  as 

Ts  =  —  x  +  xr  —  A(x  —  xr)  (99) 

T y  =  (—A2  —  A)(x  —  xr)2  +  (x  —  xr)(xr  +  2Axr  —  Axr) 

+  2A(x  -  Xy)[z  -  0(.)]  -  [z  -  0(.)](2 Xy  +  Ay)  -  X2  +  xrxr  (100) 

Using  the  manifold  condition  equation  (83), 

(M(t>)(x,xr,xr,  rs)  =€^  +  e^{-¥s(x,xr,xr)  +  Gs(x)}  (101) 

Thus,  by  the  choice  of  controls  Ts  and  Ty,  0[C(e= 

0,  x,xr,xr)]  =  0. 

2.  Results  and  Discussion 

Case  1A :  Controller  performance  for  tracking  a  continuously  time- 
varying  sine  wave  of  0.2sin(0.2^)  is  presented  in  Fig.  4.  The 
feedback  gains  chosen  are  A  =  3  and  Ay  =  1.  The  domains  of  the 
errors  are  Dx  =  [—  0.3  0.3]  and  Dz  =  [  — 1.5  1.5].  Several 

constants  in  Assumptions  B-I  are  computed  as  a1  =  1,  bx  =  0.26, 
Pi  =  1.4,  p2  —  30,  p3  =  0,  =  0.686,  a2  =  l,  P5  =  1-96, 

P6  =  250,  p7  =  0.5096,  ps  =  3.778,  and  p9  =  250.  These  values 
and  a  choice  of  d  =  0.3  results  in  e*  =2000  ^  1.  From  the 
simulation  results,  it  is  seen  that  the  system  response  is  bounded  for 
all  time.  Additionally,  for  simulations  with  e  =  0.2,  the  bounds 


Time($) 


Fig.  4  Case  1A:  kinetic  slow  state  compared  with  specified  sine- wave  reference,  and  fast  state  compared  with  manifold  approximation  and  computed 
control. 
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Fig.  5  Case  IB:  kinetic  slow  state,  fast  state  compared  with  manifold  approximation  and  computed  control  (regulator  problem). 


Fig.  6  F/A-18A  Hornet  external  physical  characteristics. 


xb  =  0.0818  and  zb  =  4.701,  and  the  control  is  bounded  for  all  time. 
The  initial  overshoot  may  be  avoided  by  adding  actuator  dynamics 
and  adjusting  the  feedback  gains.  Note  that  the  fast  state  response 
remains  close  to  its  approximation  (p(t,  x). 

Case  IB:  This  case  simulates  the  regulator  problem  with  xr  =  0 
and  xr(t )  =  0.  The  control  laws  are  the  same  as  derived  in  Eqs.  (99) 
and  (100).  The  constants  bx  =  0,  /34  =  0,  /J7  =  0,  and  /3S  =  0,  while 
the  other  constants  have  the  same  values  as  in  Case  1A.  With  the 
choice  of  d  =  0,  e*  =  1000  1.  The  results  are  presented  in  Fig.  5, 

which  shows  that  the  system  asymptotically  settles  down  to  the 
origin. 

C.  Lateral/Directional  Maneuver  for  F/A-18A  Hornet  Aircraft 

The  complete  nonlinear  dynamic  model  in  the  stability  axes  is 
represented  by  the  nine  states  (M,  a,  (3,  p,  q,  r,  0,  9,  0)  and  four 
controls  (rj,  8 e,8a,  8r).  For  this  example,  [M,  a,  0,  0 ,  0]r  comprise 
the  slow  states,  and  the  angular  rates  [p,  q,  r]T  comprise  the  fast 
states.  The  aerodynamic  database  for  the  symmetric  F/A-l  8 A  Hornet 
(seen  in  Fig.  6)  is  used  [30].  The  aerodynamic  coefficients  are  given 
as  analytical  functions  of  the  sideslip  angle,  angle  of  attack,  angular 
rates,  and  the  control  surface  deflections.  Considering  the  number  of 
controls  available,  only  three  of  the  six  slow  states  can  be  controlled. 
Throttle  is  maintained  constant  at  r]  =  0.523  and  is  not  used  as  a 
control.  This  is  a  result  of  using  dynamic  inversion  [31].  The  control 
objective  is  to  perform  a  45  deg  turn  at  or  near  zero  sideslip  angle 
while  tracking  a  specified  angle-of-attack  profile.  Pitch  attitude  angle 
0  and  bank  angle  0  are  left  uncontrolled. 

1.  Controller  Design 

The  control  laws  are  developed  according  to  the  theory  developed 
in  the  previous  sections.  For  brevity,  only  the  equations  required  for 


incorporating  the  control  law  in  the  simulation  are  presented  here. 
Since  the  aircraft  equations  of  motion  are  highly  coupled,  the  first 
step  is  to  transform  them  into  slow  and  fast  sets.  Let  x  =  [a,  j$,  0]r 
represent  the  subset  of  the  slow  states  and  u  =  [8e,  8a,  8r]T  represent 
the  control  variables, 

x  =  f1  (x,  M,  6, 0)  +  f2(x,  0,0) z  +  g(x,M)u  (102) 

f(.) 

ez  =  l(  z)  +  /(x,  M)  +  /(x,  M)z321  +  k(x,  M)  u  (103) 

K) 


The  parameter  e  is  introduced  on  the  left-hand  side  of  Eq.  (103)  to 
indicate  the  time-scale  difference  between  body-axis  angular  rates 
and  the  other  states  [14].  In  the  translational  equations  of  motion, 
functions  such  as  gravitational  forces  and  aerodynamic  forces  due  to 
angle  of  attack  and  sideslip  angle  are  collectively  represented  as 
fj(x,  M,  0, 0).  Terms  in  the  translational  equations  of  motion  due  to 
the  cross  products  between  the  angular  rates  and  the  slow  states  are 
labeled  f2(x,  0, 0)z.  The  remaining  terms  in  the  slow  state  equations 
are  the  control  effectiveness  terms  labeled  g(x,  M).  The  nonlinearity 
in  the  fast  dynamics  due  to  the  cross  product  between  the  angular 
rates  is  represented  by  /  (z)  1.  The  aerodynamic  moment  terms  that 
depend  solely  upon  the  slow  state  are  denoted  as  /(x,  M) 2,  and  the 
aerodynamic  moment  terms  that  depend  linearly  on  the  angular  rates 
are  denoted  as  /(x)3.  The  term  k(x,  M)  is  the  control  effectiveness 
term  in  the  angular  rate  dynamics.  The  exact  form  of  these  functions 
is  derived  in  the  Appendix.  Define  the  errors  x  =  x  —  xr  and  z  = 
z  —  h(x,  e,  xr,  xr,  M)  and  transform  Eqs.  (102)  and  (103)  into  error 
coordinates  equivalent  to  Eqs.  (39)  and  (40): 
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x  =  f  j (x  +  xr,  M,  0,  0)  +  f2(x  +  xr,  0,  0)[z  +  h(.)] 

^  “V 

f(-) 

+  g(x  +  x,M)[rs  +  rf]  -  xr  (104) 

cz  =  /[ z  +  h(.)]  +  /(x  +  xr,  M)  +  /(x  +  xr,  M)[z  +  h(.)]321 

/(.) 

+  k(i  +  xr,M)[rs  +  (105) 

Note  that,  for  an  aircraft  example,  the  manifold  will  also  be  a 
function  of  Mach  number.  Let  d>(x,  xr,  xr,  Ts,  M)  be  the  approx¬ 
imate  manifold.  Then,  Eq.  (38)  expresses  the  manifold  condition.  In 
this  case,  select 

<J>(x,xr,xr,Ts)  =  — /(x  +  xr,M)[l(x  +  xr,M) 

+  k(x  +  xr,  M)rs]23  -  1  (106) 

such  that 

0<f>  0$  .  0$  . 

(M$)(x,  xr>  xr>  rs)  =  6— +  e— x  +  e^M  -  /[*(.)] 

—  k(x  +  xr,  M)Tf  1  (107) 

To  design  T y,  develop  the  reduced  fast  subsystem, 

x'  =  0  (108) 

l!  =  l[z  +  0(.)]  +  /(x  +  xr,M)  +  /(x  +  xr,M)[z  +  $(.)] 

+  k(x  +  xr,  M)[TS  +  Tf]321  (109) 

Using  dynamic  inversion  and  Eq.  (106),  design 

T  f  =  k_1(x  +  xr,M){—Afz  —  l[ z  +  d>(.)]  -  l(x  +  xr,M) z}31 

(110) 

where  Af  is  the  chosen  feedback  gain.  Then,  the  closed-loop  reduced 
subsystem  becomes 

z'  =  —Afz  (111) 

Comparing  with  Eq.  (48), 

Lf(.)  =Afi  (112) 


K  f  (.)  =  0  (113) 

Similarly,  develop  the  reduced-order  slow  subsystem, 

x  =  fj(x  +  xr,  M,  0, 0)  -  f2(x  +  xr,  6, 0)/(x  +  xr,  M)/(x 

+  xr,M)  —  g(x  +  xr,M)k_1(x  +  xr,M)/(d>)  —  xr 

+  [— f2(x  +  xr,  0, 0)/(x  +  xr,  M)_1k(x  +  xr,  M ) 

+  g(x  +  xr,  M)]T, 3123  -  1  (1 14) 

Then,  the  control  law  for  the  reduced  slow  subsystem  is  computed 
as 

Ts  =  B-1{—  Ax  +  xr}  +  B_1{— fx(x  +  xr,  M,  0, 0) 

+  f2(x  +  xr,  0,(p)l(x  +  xr,M)l(x  +  xr,M)}23  -  1  (115) 

where 

B  =  [— f2(x  +  xr,  6, 0)/(x  +  xr,  M)~l k(x  +  xr,  M ) 

+  g(x  +  xr,M)\ 3 

A  is  the  feedback  gain,  and  the  resulting  closed-loop  system  is 

x  =  —Ax  —  g(x  +  xr,M)k_1(x  +  xr,M)/[0>(.)]l  (116) 

where  d>(.)  is  obtained  from  Eq.  (106).  Note  by  the  choice  of  Tf, 
Eq.  (107)  becomes 

0<j>  00  00  • 

(M$)(x,xr,xr,rs)  =£—  +  €  — x  +  e—M  (117) 

and  thus  0[C(e  =  0,x,xr,  xr)]  =  0.  Furthermore,  since  the  aero¬ 
dynamic  moments  are  a  function  of  the  angular  rates,  matrix  Z(x  + 
xr,  M) 3  is  full  rank.  The  control  effectiveness  terms  k(x  +  xr,  M) 
represent  the  aerodynamic  moment  coefficients  due  to  control 
effector  deflections,  which  are  nonzero. 

Remark  8:  The  aircraft  example  assumes  that  the  Mach  number, 
pitch  attitude  angle,  and  bank  angle  are  input  stabilizable.  Although 
the  angular  rates  are  bounded  by  the  reference  trajectory,  the  Euler 
angles  remain  bounded  through  the  exact  kinematic  relationships. 
Additionally,  since  the  angle  of  attack  is  being  tracked  and  thrust 
remains  constant,  the  Mach  number  remains  bounded. 

2.  Results  and  Discussion 

Case  2:  The  specified  maneuver  is  a  45  deg  turn  at  or  near  zero 
sideslip  angle  while  simultaneously  tracking  a  step  input  in  the  angle  of 


Tlme(s) 

- system  response  1 

- reference 

Time(s) 


Fig.  7  Case  2:  F/A-18A  Mach  number,  angle  of  attack,  and  sideslip  angle  responses;  0.3/20k. 


SIDDARTH  AND  VALASEK 


747 


attack.  The  flight  condition  is  Mach  0.3  at  20,000  ft  altitude  (0.3/ 20k). 
The  trim  and  initial  conditions  are  a(l)  =  2  deg,  p( 0)  =  4  deg/s, 
g(0)  =  —2  deg  / s,  and  r(0)  =  2  deg  / s.  The  feedback  gain  matrices 
are 


"1 

0 

0 

"5 

0 

0 

0 

1 

0 

’  Af  - 

0 

5 

0 

0 

0 

1 

0 

0 

5 

Theorem  1  guarantees  the  existence  of  the  bound  €*,  but  the 
nonlinearity  of  this  example  restricts  its  analytical  computation.  Note 
also  that,  for  an  aircraft,  the  parameter  e  is  normally  only  introduced 
in  the  modeling  stage  to  take  advantage  of  the  presence  of  different 
time  scales  in  the  system.  In  reality,  this  parameter  is  a  function  of  the 
flight  condition  and  is  difficult  to  quantify.  Thus,  it  is  advantageous  to 
derive  and  implement  controllers  that  do  not  require  knowledge  of 
this  parameter. 

Figures  7-10  evaluate  control  law  performance  for  the  specified 
maneuver.  After  initial  transients  settle  out,  the  angle  of  attack, 
sideslip  angle,  and  heading  angle  states  closely  track  the  reference. 
The  angle-of-attack  error  is  within  ±0.2  deg,  and  the  sideslip  angle 
tracking  error  is  within  ±0.2  deg  throughout  the  maneuver.  The 
heading  angle  is  maintained  within  ±0.25  deg.  Close  tracking  of  the 
slow  states  implies  that  the  fast  states  are  successfully  being  driven 
onto  the  approximate  manifold,  as  is  seen  in  Fig.  9.  The  angular  rates 
are  smooth,  and  errors  are  within  ±2  deg  /  s.  The  control  surface 
deflections  are  within  bounds  and  generate  the  desired  nonzero 
angular  rates.  The  uncontrolled  states  M,9,  and  0  are  well  behaved 
and  remain  bounded  throughout  the  maneuver. 


a  =  q - -{(/?  cos  a  ±  r  sin  a)  sin  P) 

cos  p 


± 


os  p  [mvsl 


cos P  \mvsM 
—  mg  cos  [i  cos  y 


Tmr]  sin  a  ±  -CL(a,  q,  8e)pv2M2S 


■]} 


(A2) 


P  = 
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—Tmr]  cos  a  sin  p  ±  -CY(P,  p,  r,  8e,  8a,  8r)pv2M2S 


±  mg  sin  /x  cos  y 


±  (p  sin  a  —  r  cos  a) 


(A3) 


-  — - -  qr  ±  pv2M2SbCi(P,  p,  r,  8e,  8a,  8r)  (A4) 

f  r  r 


q  =  —j  —pr  +  YfPV2sM2ScCm(a,  q,  8e)  (A5) 

*  v  v 


r  =  — - -pq  ±  - —  pv2M2SbCn(P,  p,  r,  8e,  8a,  8r)  (A6) 

1  7  £1  ~ 


0  —  p  ±  q  sin  0  tan  9  ±  r  cos  0  tan  9  (A7) 


VII.  Conclusions 

A  control  formulation  for  tracking  the  slow  states  of  a  general  class 
of  nonlinear  singularly  perturbed  systems  was  developed  based  upon 
the  study  of  its  geometric  constructs.  For  a  given  set  of  nonlinear 
algebraic  equations,  an  approximate  analytical  form  of  the  system 
manifold  was  computed.  Control  laws  for  each  of  the  subsystems  and 
boundedness  of  closed-loop  signals  was  demonstrated  with  a 
composite  Lyapunov  function  approach,  and  asymptotic  stabiliza¬ 
tion  was  shown  for  the  general  class  of  nonlinear  singularly  perturbed 
systems.  Controller  performance  was  demonstrated  through  numer¬ 
ical  simulation  for  two  nonlinear  examples. 

Based  upon  the  results  presented  in  the  paper,  tracking  error  for 
the  nonlinear  planar  example  was  demonstrated  to  remain  within 
1 0.08 1  at  all  times,  as  predicted  by  the  analytically  computed  bound. 
It  was  also  shown  that,  for  all  values  of  e,  the  controller  maintains 
bounded  stability  and  the  asymptotic  convergence  of  the  errors  to 
origin  for  the  regulator  problem.  Nonlinear  simulations  of  an  F/A- 
18A  Hornet  demonstrate  that  the  controller  is  capable  of  closely 
tracking  heading,  sideslip  angle,  and  angle  of  attack.  The  angular 
rates  were  within  bounds  and  seen  to  track  the  desired  manifold 
approximations  well.  Even  though  the  Mach  number,  bank  angle, 
and  pitch  attitude  angle  were  not  controlled,  their  magnitudes 
remained  bounded  as  expected.  The  aircraft  example  demonstrates 
the  advantage  of  developing  controllers  independent  of  the  scalar 
perturbation  parameter  e. 


Appendix 

The  nonlinear  mathematical  model  of  the  aircraft  is  represented  by 
the  following  dynamic  and  kinematic  equations: 

Tmr]  cos  a  cos  P  —  ^CD(a,  q,  8e)pv2M2S  —  mg  siny 

(Al) 


9  =  q  cos  0  —  r  sin  0 

(A8) 

0  =  (q  sin  0  ±  r  cos  0)  sec  9 

(A9) 

Wind-axes  orientation  angles  /x  and  y  are  defined  as  follows: 
sin  y  =  cos  a  cos  p  sin  9  —  sin  p  sin  0  cos  0 
—  sin  a  cos  P  cos  0  cos  9  (A10) 

sin  /x  cos  y  =  sin  9  cos  a  sin  P  ±  sin  0  cos  9  cos  P 
—  sin oi  sin  P  cos  0 cos  9  (All) 


cos  /x  cos  y  =  sin  9  sin  a  ±  cos  a  cos  0  cos  9  (A12) 


To  write  the  equations  in  the  form  of  Eqs.  (102)  and  (103), 


f  j(x,M,  9, 0) 


-mv,M  cosi  k  CL  {ct)pv2sM2S-  mg  cos  II  cosy 


\  CY(P)pv2sM2S  ±  mg  sin  pi  cos  y 
0 


(A13) 


f20,6>,  0) 


—  cos  a  tan  P 
sin  a 
0 


1  —  sin  a  tan  P 

0  —  cos  a 

sec  9  sin  0  cos  0  sec  9 

(A  14) 
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g(x,M) 

_  2cospP(V*M)  SClS(; 

0 
0 


0  0 

2mvJ{CYs  pv2sM2S  ^CYSrpv^M2S 


0 


0 


(A15) 


/( z)  = 


Iy~L 

~ jfqr 

L-Ix 

- jfPr 

IX~Iy 

-ifpy 


(A  16) 


/  (x,  M)  = 


'  ±-xPv]M2SbCm 

f pv]M2ScCm{a ) 

fpv2M2SbCn(P) 


(A17) 


/  (x,  M) 

^pu2M2SfcC, 


21, 

0 


0 


A;pu?M25cCm 
^M2^  ’  0 


27-  pv]M2SbCir 

0 

^-pv^SbC^ 


(A18) 


k  (x,  M) 


0 


±-xPv]M2SbCka  A. pV]M2SbCha 


0 


0 


A;pi;2M2ScCmje 

0  jr-pv2M2SbCn  jr  pv2sM2SbCn 


(A19) 
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Nonlinear  Singularly  Perturbed  Aircraft 
Systems 
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Abstract  The  problem  of  simultaneous  tracking  of  both  fast  and  slow  states  for  a 
general  class  of  nonlinear  singularly  perturbed  systems  is  addressed.  A  motivating 
example  is  an  aircraft  tracking  a  prescribed  fast  moving  target,  while  simultane¬ 
ously  regulating  speed  and/or  one  or  more  kinematic  angles.  Previous  results  in  the 
literature  have  focused  on  tracking  outputs  that  are  a  function  of  the  slow  states 
alone.  Moreover,  global  asymptotic  tracking  has  been  demonstrated  only  for  a  class 
of  nonlinear  systems  that  possess  a  unique  real  root  for  the  fast  states.  For  a  more 
general  class  of  nonlinear  systems  only  local  tracking  results  have  been  proven. 
In  this  paper,  control  laws  that  accomplish  global  tracking  of  both  fast  and  slow 
states  is  developed  using  geometric  singular  perturbation  methods.  Global  exponen¬ 
tial  stability  is  proven  using  the  composite  Lyapunov  function  approach  and  an  up¬ 
per  bound  necessary  condition  for  the  scalar  perturbation  parameter  is  derived.  Con¬ 
troller  performance  is  demonstrated  through  simulation  of  a  combined  longitudinal 
lateral/directional  maneuver  for  a  nonlinear,  coupled,  six  degree-of-freedom  model 
of  the  F/A-18A  Hornet.  Results  presented  in  the  paper  show  that  the  controller  ac¬ 
complishes  global  asymptotic  tracking  even  though  the  desired  reference  trajectory 
requires  the  aircraft  to  switch  between  linear  and  nonlinear  regimes.  Asymptotic 
tracking  while  keeping  all  the  closed-loop  signals  bounded  and  well  behaved  is  also 
demonstrated.  Additionally  the  controller  is  independent  of  the  scalar  perturbation 
parameter  nor  does  it  require  knowledge  of  it. 


1  Introduction 

This  paper  addresses  systems  that  possess  both  slow  and  fast  dynamics.  This  mul¬ 
tiple  time-scale  behaviour  is  either  a  system  characteristic  (for  example,  aircraft 
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and  flexible  beam  structures)  or  arises  due  to  implementation  of  a  fast  controller 
(for  example,  systems  with  fast  actuators  and/or  fast  sensors).  The  control  objective 
is  to  develop  a  stable  tracker  for  these  two  time-scale  systems  that  would  regulate 
both  slow  and  fast  states  simultaneously.  The  singular  perturbation  approach[13]  has 
been  the  foremost  technique  employed  in  the  literature  to  examine  the  behaviour  of 
these  two  time-scale  systems.  In  this  approach,  the  system  dynamics  are  approxi¬ 
mated  by  two  lower-order  subsystems.  The  slow  subsystem  captures  the  dominant 
phenomena  assuming  that  the  fast  variables  evolve  infinitely  many  times  faster,  and 
have  settled  down  onto  a  manifold.  The  fast  subsystem  addresses  the  neglected  phe¬ 
nomena,  and  assumes  that  the  slow  variables  remain  constant.  It  has  been  shown  that 
the  complete  system  behaviour  can  be  approximated  by  the  dynamics  of  the  slow 
subsystem  provided  the  fast  subsystem  is  uniformly  asymptotically  stable  about  the 
manifold  [6,  10].  These  results  of  singular  perturbation  methods  have  made  it  the 
most  favourable  model-reduction  technique  in  the  control  literature [14]. 

The  design  of  nonlinear  tracking  control  laws  for  the  slow  variables  using  sin¬ 
gular  perturbation  methods  has  received  a  lot  of  attention  in  the  past.  The  typical 
methodology  is  to  design  two  separate  controllers  for  each  of  the  two  subsystems, 
and  then  apply  their  composite  or  sum  to  the  full-order  system.  A  tracking  control 
law  is  designed  for  the  slow  subsystem  whereas  a  stabilizing  controller  is  designed 
for  the  fast  subsystem.  This  is  done  to  restrict  the  fast  variables  onto  a  manifold. 
Global  asymptotic  tracking  of  the  composite  control  structure  is  guaranteed  only  if 
the  manifold  is  unique.  This  manifold  is  the  set  of  fixed  points  of  the  fast  dynamics 
expressed  as  a  smooth  function  of  the  slow  variables  and  the  control  inputs;  hence 
it  is  not  always  unique.  To  enforce  the  uniqueness  condition,  previous  studies  in  the 
literature  have: 

1.  Assumed  that  the  system  has  a  unique  manifold[4,  8] 

2.  Considered  nonlinear  systems  that  have  a  unique  manifold.  This  is  satisfied  by 

two  time-scale  systems  that  are  nonlinear  in  the  slow  states  and  linear  in  the  fast 

states[ll] 

For  a  general  class  of  nonlinear  systems  such  as  aircraft,  approximate  approaches 
that  guarantee  local  stability  have  been  proposed.  One  approach  is  to  consider  the 
fast  variables  as  control  inputs  for  the  slow  subsystem.  Referenced  2]  used  this 
approach  to  design  nonlinear  flight  test  trajectories  for  velocity,  angle-of- attack, 
sideslip  angle  and  altitude  by  using  the  fast  angular  rates  as  the  control  variables. 
This  control  was  augmented  with  an  outer-loop  controller  that  determines  the  con¬ 
trol  surface  deflections  needed  to  ensure  that  the  angular  rates  track  the  output  of 
the  inner-loop.  More  recently  the  same  concept  has  been  employed  for  the  control 
of  generic  reentry  vehicles[7].  Another  approach  proposed  in  Reference[16]  consid¬ 
ered  the  general  class  of  nonlinear  singularly  perturbed  systems  and  computed  local 
approximations  of  the  manifold  that  helped  conclude  local  stability  for  the  complete 
system. 

All  of  the  approaches  discussed  above  demonstrate  slow  state  tracking  either  lo¬ 
cally  or  globally  by  restricting  the  fast  states,  and,  they  address  the  output  tracking 
problem  for  two  time-scale  systems  with  fast  actuators.  But  for  systems  whose  dy- 
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namics  inherently  possess  different  time-scales,  both  the  slow  and  the  fast  states 
constitute  the  output  vector.  For  example,  during  air  combat  maneuvering  an  air¬ 
craft  is  typically  required  to  track  a  fast  moving  target  while  regulating  speed  (slow 
variable)  and/or  one  or  more  kinematic  and  aerodynamic  angles.  In  this  case  the  fast 
states  cannot  be  restricted  to  simply  stabilize  onto  a  manifold.  The  reduced-order 
approach  therefore  appears  to  be  inadequate  for  a  general  class  of  output  track¬ 
ing  problem.  Reference[l]  formulated  optimal  control  laws  to  accomplish  fast  state 
tracking  using  invariant  measures  for  systems  with  oscillatory  fast  dynamics. 

In  this  paper,  state  feedback  control  laws  are  developed  for  a  general  class  of  non¬ 
linear  singularly  perturbed  systems  to  accomplish  slow  and  fast  state  tracking  simul¬ 
taneously.  The  paper  makes  two  major  contributions.  First,  the  approach  developed 
here  employs  the  reduced-order  technique  without  imposing  any  assumptions  about 
the  fast  manifold.  This  is  done  by  extending  the  previous  work  of  the  authors [16]  so 
as  to  not  require  computation  of  the  manifold.  Second,  global  exponential  tracking 
is  proved  using  the  composite  Lyapunov  approach[10].  From  the  stability  analysis 
it  is  shown  that  this  approach  applies  to  all  classes  of  singularly  perturbed  systems, 
with  the  global  exponential  stabilization  results  of  a  class  of  singularly  perturbed 
systems  being  a  special  case  [3].  Additionally,  the  technique  is  independent  of  the 
scalar  perturbation  parameter  and  an  upper  bound  on  this  parameter  is  derived  as  a 
necessary  condition  for  stability  results  to  hold.  These  results  are  demonstrated  by 
simulation  for  a  nonlinear,  coupled,  six  degree-of-freedom  model  of  the  F/A-18A 
Hornet. 

The  paper  is  organized  as  follows.  Section  2  mathematically  formulates  the  con¬ 
trol  problem  and  briefly  reviews  the  necessary  concepts  for  model  reduction  from 
geometric  singular  perturbation  theory.  Control  laws  and  the  main  results  of  the  pa¬ 
per  are  detailed  in  Section  3.  Section  4  presents  simulation  results,  and  conclusions 
are  presented  in  Section  5. 

2  Problem  Formulation  and  Model  Reduction 

The  following  nonlinear  singularly  perturbed  model  represents  the  class  of  two  time- 
scale  dynamical  systems  addressed  in  this  paper 


x  =  f(x,z)  +  g(x,z)u 
£Z  =  l(x,z)  +  k(x,z)u 


(1) 

(2) 


X 


(3) 


where  x  e  Mm  is  the  vector  of  slow  variables,  z  e  W1  is  the  vector  of  fast  variables, 
u  G  is  the  input  vector  and  y  G  Mm+n  is  the  output  vector.  £  £  M+  is  the  singu¬ 
lar  perturbation  parameter  that  satisfies  0  <  £  «  1.  The  vector  fields  f(.),g(.),l(.) 
and  k(.)  are  assumed  to  be  sufficiently  smooth  and  p  >  ( m  +  n ).  The  control  objec¬ 
tive  is  to  drive  the  output  so  as  to  track  sufficiently  smooth,  bounded,  time-varying 
trajectories,  such  that  x(t)  ^  xr(t)  and  z(t)  -A  zr(t)  as  t  -A  °o. 
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2.1  Reduced-order  Modeling 


The  singularly  perturbed  model  considered  in  Eqs.1,2  is  expressed  in  the  slow  time 
scale  t.  In  this  time-scale  the  slow  states  evolve  at  an  ordinary  rate  whereas  the  fast 
states  move  at  a  rate  of  O  (f ) .  This  system  can  be  equivalently  expressed  in  the  fast 
time-scale  T  such  that  the  fast  states  evolve  at  an  ordinary  rate  and  the  slow  variables 
move  slowly  at  a  rate  of  0(e) 

x'  =  £[f(x,z)  +  g(x,z)u]  (4) 

z!  =  l(x,  z)  +  k(x,  z)u  (5) 

where  '  represents  a  derivative  with  respect  to  T  =  tj=^-  and  to  is  the  initial  time.  Ge¬ 
ometric  singular  perturbation  theory  [6]  examines  the  behaviour  of  these  singularly 
perturbed  systems  by  studying  the  geometric  constructs  of  reduced-order  models 
obtained  by  substituting  £  =  0  in  Eqs.1,2  and  Eqs.4,5.  This  results  in  the  Reduced 
Slow  Subsystem 


X  =  f(x,z)+g(x,z)u  (6) 

0  =  l(x,  z)  +  k(x,  z)u  (7) 

and  the  Reduced  Fast  Subsystem 

x'  =  0  (8) 

z!  =  l(x,  z)  +  k(x,  z)u  (9) 


The  reduced  slow  subsystem  is  a  locally  flattened  space  of  the  complete  system 
(Eqs.1,2).  Since  the  vector  fields  were  assumed  to  be  sufficiently  smooth  there  ex¬ 
ists  a  smooth  diffeomorphism  that  maps  the  complete  system  onto  this  locally  flat¬ 
tened  space.  The  set  of  points  (x,  z,  u)  G  Mm  x  W1  x  Rp  is  a  smooth  manifold  Jfo  of 
dimension  m  +  p  that  satisfies  the  algebraic  Eq.7: 

^o*z  =  Zo(x,u)  (10) 

This  set  of  points  is  identically  the  fixed  points  of  the  reduced  fast  subsystem  (Eq.9). 
Thus  the  manifold  ^o  is  invariant.  The  flow  on  this  manifold  is  described  by  the 
differential  equation 


x  =  f(x,Z0(x,u))+g(x,Z0(x,u))u  (11) 

Fenichel[6]  proved  that  the  dynamics  of  a  singularly  perturbed  system  of  the  form 
represented  in  Eqs.1,2  is  constrained  within  0(e)  of  Eq.ll  if  the  reduced  fast  sub¬ 
system  is  stable  about  If  the  dynamics  of  Eq.ll  are  locally  asymptotically 
stable  about  the  manifold,  then  it  can  be  concluded  that  the  complete  system  is 
also  locally  asymptotically  stable.  Global  asymptotic  stability  conclusions  about  the 
complete  system  can  only  be  made  if  the  manifold  is  unique,  which  is  a  result  from 
differential  topology  and  center  manifold  theory  [6] . 
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3  Control  Formulation  and  Stability  Analysis 


The  central  idea  in  the  development  is  the  following.  If  the  manifold  is  unique  and 
an  asymptotically  stable  fixed  point  of  the  reduced  fast  subsystem,  the  complete 
system  follows  the  dynamics  of  the  reduced  slow  subsystem  globally.  Therefore,  for 
a  tracking  problem  addressed  in  this  paper  it  is  desired  that  this  manifold  lie  exactly 
on  the  desired  fast  state  reference  for  all  time.  This  condition  can  be  enforced  if  the 
nonlinear  algebraic  set  of  equations  is  augmented  with  a  controller  that  enforces 
the  reference  to  be  the  unique  manifold.  Additionally,  this  controller  should  also  be 
capable  of  simultaneously  driving  the  slow  states  to  their  specified  reference.  These 
ideas  are  mathematically  formulated  and  analyzed  in  the  following  subsections. 


3.1  Control  Law  Development 

The  objective  is  to  augment  the  two  time-scale  system  with  controllers  such  that 
the  system  follows  smooth,  bounded,  time- varying  trajectories  [xr(t),zr(t)]T .  The 
first  step  is  to  transform  the  problem  into  a  non-autonomous  stabilization  control 
problem.  Define  the  tracking  error  signals  as 

e(f)  =  x(t)-xr(t)  (12) 

£(f)  =  z{t)-zr(t)  (13) 

Substituting  Eqs.1,2,  the  tracking  error  dynamics  are  expressed  as 

e  =  f(x,  z)  +  g(x,  z)u  -  xr  =  F(e,  % ,  xr,  zr,xr)  +  G(e,  | ,  xn  zr)u  (14) 

et,  =  l(x,z)  +  k(x,z)u-£zr  =  L(e,^,xr,zr,ezr)  +  K(e,^,xr,zr)u  (15) 

The  control  law  is  formulated  using  the  reduced-order  models  for  the  complete  sta¬ 
bilization  problem,  which  is  obtained  using  the  procedure  developed  in  Section  2. 


Reduced  Slow  Subsystem 

e  =  F(e,^,xr,zr,xr)+G(e,^,xr,zr)u0  (16) 

0  =  L(e,^,xr,zr,0)+K(e,^,xr,zr)u0  (17) 

Reduced  Fast  Subsystem 

e'  =  0  (18) 

%  =  L(e,£,xr,zr,z^)+K(e,£,xr,zr)(u0+u/)  (19) 


It  is  known  that  the  fast  tracking  error  ^  will  settle  onto  the  manifold  that  is  a  func- 
tion  of  the  error  e  and  control  input  u,  which  may  not  necessarily  be  the  origin. 
To  steer  both  errors  to  the  origin,  the  control  input  must  be  designed  such  that  the 
origin  becomes  the  unique  manifold  of  the  reduced  slow  system  (Eqs.16,17).  There- 
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fore,  the  slow  controller  uo  is  designed  to  take  the  form 

G(e,§,xr,zr) 

_K(e,<*,xr,z  r) 

where  Ae  and  A^  specify  the  desired  closed-loop  characteristics.  With  this  choice  of 
slow  control,  the  reduced  fast  subsystem  becomes 


uo  =  - 


F(e,^,xr,zr,xr) 

L(e,£,xr,zr,0) 


Aee 


(20) 


e'  =  0  (21) 

=  L(e,^,xr,zr,z')-L(e,^,xr,zr,0)+A^+K(e,^,xr,zr)u/  (22) 

To  stabilize  the  fast  subsystem,  the  fast  control  u/  is  designed  as 

"G(e,£,xr,zr) 

_K(e,  £,xr,zr) 


0 

L(e,  § , xr,  zr,  0)  -  L(e,  % ,  xr,  zr,  zj.) 


(23) 


Thus,  the  composite  control  u  =  uq  +  u/  satisfies 


G(e,^,xr,zr) 

K(e,^,xr,zr) 


F(e,^,xr,zr,xr) 

L(e,£,xr,zr,z0 


assuming  that  the  rank  of 


G(.) 

K(.) 


>  (m  +  n). 


(24) 


The  complete  closed-loop  and  reduced  slow  subsystem  for  this  control  law  are  given 
as 


e  =  Aee  (25) 

(26) 

and 

e  =  Aee  (27) 

0  =  A^.  (28) 

respectively.  Observe  that  with  the  proposed  control  law  the  nonlinear  algebraic  set 
of  equations  (Eq.17)  have  been  transformed  to  a  linear  set  of  equations  (Eq.28). 
With  the  proper  choice  of  A  g ,  it  is  guaranteed  that  £  =  0  is  the  unique  manifold  for 
both  the  complete  and  the  reduced  slow  subsystems.  Furthermore,  this  manifold  is 
exponentially  stable  as  can  be  deduced  from  the  reduced  fast  subsystem 


e'  =  0 


(29) 

(30) 
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Remark  1  The  control  law  proposed  in  Eq.24  is  independent  of  the  perturbation 
parameter  £.  Furthermore  it  is  a  function  of  z'  that  implies  that  the 
reference  trajectory  chosen  for  the  fast  states  must  be  faster  when  com¬ 
pared  to  the  reference  of  the  slow  states.  Additionally,  as  for  all  singular 
perturbation  techniques  to  work  the  closed-loop  eigenvalues  Ae  and 
must  be  chosen  so  as  to  maintain  the  time-scale  separation. 


3.2  Stability  Analysis 


Complete  system  stability  is  analyzed  using  the  composite  Lyapunov  function 
approach[10].  Suppose  that  there  exist  positive  definite  Lyapunov  functions  V (t,  e)  = 
ere  and  W(t ,£)  =  for  the  reduced  subsystems,  with  continuous  first-order 
derivatives  satisfying  the  following  properties: 

1.  v(t,0)  =  0  and  7i||e||2  <  V(t,e)  <  72||e||2  \/t  £  M+,e  £  Mm,yi  =  72  =  1, 

2.  (VeV(/,e))rAee  <  -0£iere,  ai  =  2|Amin(Ae)|, 

3.  W(/,0)  =0  and  rsll^l2  <  <  r4||^||2  Vr  e  M+,^  =  74  =  1, 

4.  iy^W{t^))TA^  <  a2  =  2|Amin(^)|. 

Next,  consider  the  composite  Lyapunov  function  v(f,e,£)  :  M+  x  Mm  x  W1  -A  M+ 
defined  by  the  weighted  sum  of  V(t,e)  and  for  the  complete  closed-loop 

system 


v(r,e,§)  =  (l-d)V(t,e)+dW(t,£),  0<  d<  1  (31) 


The  derivative  of  v(f,e,  £)  along  the  closed-loop  trajectories  Eqs.25,26  is  given  by 


v  =  (1  -d)(yev)Tt+d(s/^w)Tt, 

v  =  {l-d)(VeV)TAee+^{V^W)TA^ 

v  <  -(1  - d)aieTe-  ^a2^r % 


e 

T 

O 

1 

e 

A. 

.  0  i«2_ 

A. 

(32) 

(33) 

(34) 

(35) 


Following  the  approach  proposed  in  Reference^],  add  and  subtract  2 av(f,e,  £)  to 
Eq.35  to  get 


v  <  - 


(l-d)a\ 

0 


e 

1 


+  2a(l  -  d)V  +  2  adW  -  2  av 


(36) 


where  a  >  0.  Substitute  in  Eq.36  for  the  Lyapunov  functions  V (t.e)  and  W(t,^)  to 
get 
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e 

T 

O 

1 

<N 

1 

1 

1 _ 

A 

0  ia2~  2  ad 

e 


-lav 


(37) 


If  £  satisfies 


£  <  £* 


02 

la 


(38) 


and  provided  a\  >  la,  then  from  the  definitions  of  02,  a,  and  d  it  can  be  concluded 
that  the  matrix  in  Eq.37  is  positive  definite.  Then  the  derivative  of  the  Lyapunov 
function  is  lower-bounded  by 


v  <  -2ov  (39) 

Since  the  composite  Lyapunov  function  lies  within  the  following  bounds 

(l-^)yi||e||2  +  ^y3||£H2<  V(f,e,§)  <  (l-^)r2||e||2+c?74||^||2  (40) 


or, 


7n 


e 

% 


2 

<  v(r,e,|)  <  722 


e 

% 


2 


(41) 


where  i  =  min  (( 1  —  d)  ,  dy^ )  and  722  =  min  ( ( 1  —  d)  72 ,  dyf) ,  the  derivative  of  the 
Lyapunov  function  can  be  expressed  as 


v  <  -20711 


e 

% 


2 


(42) 


Lrom  the  definition  of  the  constants  71 1,  722,  and  o,  and  invoking  Lyapunov’s  Di¬ 
rect  Method[9],  uniform  exponential  stability  in  the  large  of  (e  =  0,  £  =  0)  can  be 
concluded.  Lurthermore,  since  the  reference  trajectory  xr(t)  and  z r(t)  is  bounded,  it 
is  concluded  that  the  states  x(t)  —>  xr(t)  and  z(t)  -»  zr(t)  as  t  0°.  Since  the  matrix 
"  G(  )  1 

is  restricted  to  be  full  rank,  examining  the  expression  for  u  in  Eq.24  it  is 

_K(-J  J 

concluded  that  u  G  Loo. 


Remark  2  Recall  that  for  the  special  case  of  state  regulation  the  system  dynamics 
in  Eqs.14,15  become  autonomous.  In  such  a  case,  the  result  of  global 
exponential  stability  is  obtained  with  less-restrictive  conditions  on  the 
Lyapunov  functions  V(e),  W(^)9  and  consequently  v(e,^).  Similar 
conclusions  were  made  in  Reference^]  for  the  stabilization  problem 
of  a  special  class  of  singularly  perturbed  systems  where  the  control 
affects  only  the  fast  states.  Note  that  for  the  special  class  of  systems 
considered  in  Reference  [3],  the  non-diagonal  elements  of  the  matrix  in 
Eq.37  are  nonzero,  and  the  bound  on  the  parameter  £  is  slightly  differ¬ 
ent. 
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Remark  3  From  Eq.37,  a  conservative  upper  bound  for  a  is  a  <  (j-,  and  conse¬ 
quently  £*  «  Therefore,  qualitatively  this  upper  bound  is  indirectly 
dependent  upon  the  choice  of  the  closed-loop  eigenvalues. 


4  Numerical  Simulation 

The  purpose  of  the  example  is  to  demonstrate  the  methodology  and  controller  per¬ 
formance  for  an  under-actuated,  nonlinear,  singularly  perturbed  system.  The  system 
studied  is  a  nonlinear,  coupled,  six  degree-of-freedom  F/A-18A  Hornet  aircraft[5]. 
The  combined  longitudinal-lateral/directional  maneuver  requires  tracking  of  the  fast 
variables,  in  this  case  body-axis  pitch  and  roll  rates,  while  maintaining  zero  sideslip 
angle.  Closed-loop  characteristics  such  as  stability,  accuracy,  speed  of  response  and 
robustness  are  qualitatively  analyzed.  The  maneuver  consists  of  an  aggressive  verti¬ 
cal  climb  with  a  pitch  rate  of  25  deg/sec,  followed  by  a  roll  at  a  rate  of  50  deg/sec, 
while  maintaining  zero  sideslip  angle.  The  Mach  number  and  angle-of-attack  are 
assumed  to  be  input- to- state  stable.  The  initial  conditions  are  a  Mach  number  of 
0.4  at  15,000  feet,  an  angle-of-attack  of  10  deg,  and  elevon  angle  of  —11.85  deg. 
All  other  states  are  zero.  The  F/A-18A  Hornet  model  is  expressed  in  stability  axes. 
Since  it  is  difficult  to  cast  the  nonlinear  aircraft  model  into  the  singular  perturbation 
form  of  Eq.1-2  ,  the  perturbation  parameter  £  is  introduced  in  front  of  those  state 
variables  that  have  the  fastest  dynamics.  This  is  done  so  that  the  results  obtained  for 
£  =  0  will  closely  approximate  the  complete  system  behaviour  (with  £  =  1).  This  is 
called  the  forced  perturbation  technique,  and  is  commonly  used  in  the  aircraft  liter¬ 
ature  [2,  12].  Motivated  by  experience  and  previous  results,  the  six  slow  states  are 
Mach  number  M,  angle-of-attack  a,  sideslip  angle  j 3  and  the  three  kinematic  states: 
bank  angle  (j),  pitch-attitude  angle  6 ,  and  heading  angle  y/.  The  three  body-axis  an¬ 
gular  rates  (p,q,r)  constitute  the  fast  states.  The  control  variables  for  this  model  are 
elevon  Se,  aileron  Sa,  and  rudder  Sr  and  are  assumed  to  have  sufficiently  fast  enough 
actuator  dynamics.  The  convention  used  is  that  a  positive  deflection  generates  a  neg¬ 
ative  moment.  The  throttle  rj  is  maintained  constant  at  80%,  because  slow  engine 
dynamics  require  introduction  of  an  additional  time-scale  in  the  analysis;  this  is  a 
consideration  which  is  beyond  the  scope  of  this  paper.  The  aerodynamic  stability 
and  control  derivatives  are  represented  as  nonlinear  analytical  functions  of  aerody¬ 
namic  angles  and  control  surface  deflections.  Quaternions  are  used  to  represent  the 
kinematic  relationships  from  which  the  Euler  angles  are  extracted.  The  details  of 
these  relationships  are  discussed  in  Reference[15]. 


Results  and  Discussion 

Simulation  results  in  Figures  1-6  show  that  all  controlled  states  closely  track  their 
references.  At  two  seconds  the  aircraft  is  commanded  to  perform  a  vertical  climb, 
and  after  eight  seconds  the  pitch  rate  command  changes  direction  and  Mach  num- 
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ber  drops.  The  lateral/directional  states  and  controls  are  identically  zero  until  the 
roll  command  is  introduced  at  time  equals  15  seconds.  Observe  that  all  of  the  states 
asymptotically  track  the  reference.  Figure  2  shows  that  the  elevon  deflection  re¬ 
mains  within  specified  limits [5]  throughout  the  vertical  climb,  and  the  commanded 
roll  produces  a  sideslip  angle  which  is  negated  by  application  of  the  rudder.  The 
aileron  and  the  rudder  deflections  remain  within  bounds  while  the  aircraft  rolls  and 
comes  back  to  level  flight.  The  maximum  pitch- attitude  angle  is  81  deg,  maximum 
bank  angle  is  81  deg  (Figure  4),  and  the  maximum  sideslip  error  is  ±4deg.  The 
quaternions  and  the  complete  trajectory  are  shown  in  Figures  5  and  6  respectively. 
From  Figure  6,  note  that  after  completing  the  combined  climb  and  roll  maneuver, 
the  aircraft  is  commanded  to  remain  at  zero  sideslip  angle,  roll  rate,  and  pitch  rate. 
It  then  enters  a  steady  dive  with  all  other  aircraft  states  bounded.  The  controller  re¬ 
sponse  is  judged  to  be  essentially  independent  of  the  reference  trajectory  designed. 
The  robustness  properties  of  the  controller  are  quantified  by  the  upper  bound  £*.  For 
this  example,  the  design  variables  are  d  =  0.5,  oq  =  10,  a  =  2,  and  oq  =  15,  so  the 
upper  bound  becomes  £*  =  7.5.  Therefore  for  all  £  <  £*  global  asymptotic  tracking 
is  guaranteed  and  in  this  case  £  =  1 . 
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Fig.  1  Body- Axis  Angular  Rates 
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Fig.  2  Control  Surface  Deflections 
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Fig.  3  Mach  Number  and  Angle-of-Attack 


Fig.  4  Sideslip  Angle  and  Kinematic  Angles 
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Fig.  5  Quaternion  Parameters 


Fig.  6  Three-Dimensional  Trajectory 


5  Conclusions 

A  control  law  for  global  asymptotic  tracking  of  both  the  slow  and  the  fast  states  for  a 
general  class  of  nonlinear  singularly  perturbed  systems  was  developed.  A  composite 
control  approach  was  adopted  to  satisfy  two  objectives.  First,  it  enforces  the  speci¬ 
fied  reference  for  the  fast  states  to  be  ‘the  unique  manifold’  of  the  fast  dynamics  for 
all  time.  Second,  it  ensures  that  the  slow  states  are  tracked  simultaneously  as  desired. 
Stability  of  the  closed-loop  signals  was  analyzed  using  the  composite  Lyapunov  ap¬ 
proach,  and  controller  performance  was  demonstrated  through  numerical  simulation 
of  a  nonlinear,  coupled,  six  degree-of-freedom  model  of  an  F/A-18A  Hornet.  The 
control  laws  were  implemented  without  making  any  assumptions  about  the  nonlin¬ 
earity  of  the  six  degree-of-freedom  aircraft  model.  Based  on  the  results  presented 
in  the  paper,  the  following  conclusions  are  drawn.  First,  both  positive  and  nega¬ 
tive  angular  rate  commands  were  seen  to  be  perfectly  tracked  by  the  controller  and 
consistent  tracking  was  guaranteed  independent  of  the  desired  reference  trajectory. 
Second,  throughout  the  maneuver  the  controller  demonstrated  global  asymptotic 
tracking  even  though  the  desired  reference  trajectory  requires  the  aircraft  to  switch 
between  linear  and  nonlinear  regimes.  This  robust  performance  of  the  controller  was 
shown  to  hold  for  all  £  <  £*  =  7.5.  Third,  all  closed-loop  signals  were  bounded  and 
the  control  surface  deflections  computed  were  smooth  and  within  specified  limits. 
Fourth,  this  technique  does  not  require  the  knowledge  of  the  perturbation  parameter 
£.  This  is  an  important  consideration  for  systems  such  as  aircraft,  where  quantifying 
this  parameter  can  be  difficult. 
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Output  Tracking  of  Non-Minimum  Phase  Dynamics 

Anshu  Siddarth*  and  John  ValaseP 

Texas  A  &  M  University,  College  Station,  TX  778/3-31/1. 

This  paper  develops  a  general  control  algorithm  for  the  exact  output  tracking  of  non¬ 
linear  systems  with  non-minimum  phase  dynamics.  The  control  technique  is  causal  and 
does  not  require  preview  or  knowledge  of  the  desired  reference  beforehand.  Additionally, 
the  control  is  independent  of  the  operating  condition  and  the  desired  reference.  The  main 
idea  of  the  paper  is  to  convert  the  output  tracking  problem  into  a  slow  state  tracking 
problem  for  singularly  perturbed  systems.  Previous  work  on  singularly  perturbed  systems 
have  shown  asymptotic  tracking  of  slow  states  only  for  a  class  of  nonlinear  systems  that 
are  linear  in  the  fast  states.  However,  this  paper  develops  a  control  technique  that  does 
not  have  this  restriction  and  is  applicable  to  a  general  class  of  nonlinear  singularly  per¬ 
turbed  systems.  The  procedure  is  to  compute  the  desired  internal  state  trajectory  and  the 
control  scheme  that  stabilizes  the  nonlinear  system  online,  thereby  guaranteeing  asymp¬ 
totic  output  tracking.  Performance  of  this  approach  is  demonstrated  in  simulation  for  two 
benchmark  problems:  the  beam-ball  example  that  is  slightly  non-minimum  phase  and  fails 
to  have  a  well-defined  relative  degree,  and  the  Conventional  Take-off  and  Landing  (CTOL) 
non-minimum  phase  aircraft.  Results  presented  in  the  paper  show  that  the  approach  is 
able  to  accomplish  perfect  tracking  while  stabilizing  the  closed-loop  system,  while  keeping 
all  closed-loop  signals  bounded. 


Nomenclature 


n 

V 

r 

t 

u 

x 

y 


order  of  the  nonlinear  system 

number  of  control  variables  and  number  of  outputs 

relative- degree  of  the  output 

slow  time 

control  vector 

state  vector 

output  vector 


Greek 

(£>»?) 


v{t) 


T 


normal  coordinates  of  the  system 
singular  perturbation  parameter 
desired  output  dynamics 
fast  time 


Subscripts 

d 

0 


reference  trajectory 
reference  quantity 


Symbols 

time  derivative  with  respect  to  slow  time 
'  time  derivate  with  respect  to  fast  time 

OQ  order  symbol 
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I.  Introduction 


Output  tracking  control  structures  have  received  considerable  attention  in  the  literature.  Nonlinear 
control  techniques  such  as  Feedback  Linearization  and  Sliding  Mode  Control  are  able  to  guarantee  closed- 
loop  stability  and  precise  output  tracking,  but  only  for  a  specific  class  of  nonlinear  systems  that  are  minimum 
phase.  Additionally,  these  control  approaches  require  that  the  output  have  a  well-defined  relative  degree. 
But  there  are  a  number  of  important  flight  control  problems  such  as  acceleration  control  of  tail-controlled 
missiles,2  control  of  planar  Vertical  Take-off  and  Landing  (V/STOL)  aircraft,  and  Conventional  Take-off 
and  Landing  (CTOL)  aircraft  that  are  characterized  by  unstable  zero  dynamics,  thereby  not  satisfying  the 
conditions  cited  above.  These  restrictions  and  the  need  to  develop  stabilizing  trackers  have  paved  the  way 
for  control  algorithms  that  are  applicable  to  a  more  general  class  of  nonlinear  systems. 

The  technique  presented  by  Benvenuti  et.al  modified  the  output  of  a  corresponding  linear  system  so  that 
it  does  not  contain  right  half-plane  zeros.  A  similar  technique  was  employed  by  Hedrick  and  Gopalswamy6 
to  track  pilot  g  commands  while  satisfying  flying  quality  specifications.  These  approaches  were  able  to 
guarantee  ‘local’  tracking  that  is  specific  to  the  desired  flight  condition  and  reference  trajectory.  Another 
approximate  approach  proposed  by  Doyle  et.al  takes  a  sufficient  number  of  derivatives  of  the  output  such 
that  the  control  and  its  higher-order  derivatives  appear  in  the  equation.  The  paper  proposed  to  modify  the 
sign  of  some  of  the  control  derivatives  in  order  to  render  the  modified  output  dynamics  minimum  phase.  It 
was  shown  that  these  modified  output  dynamics  closely  approximate  the  actual  dynamics  of  the  system.  In 
contrast  to  the  former,  Shklnikov  and  Shtessel8  modified  the  sliding  surface  to  ensure  that  the  right  half-plane 
zero  is  canceled  out.  The  system  was  required  to  be  in  normal  form  with  bounded  nonlinearities,  and  the 
technique  was  demonstrated  for  an  F-16  aircraft.  Considering  the  local  nature  of  these  works,  Zhu  et.al 
proposed  a  controller  which  separates  the  internal  dynamics  into  linear  and  nonlinear  parts.  The  linear  part 
is  stabilized  by  linear  state  feedback,  whereas  the  nonlinear  part  is  stabilized  only  if  the  system  strays  away 
from  the  trajectory.  In  an  effort  to  control  the  V/STOL  slightly  non-minimum  phase  aircraft,  Hauser  et.al 
neglected  some  terms  that  are  the  cause  of  this  unstable  behaviour,  and  proved  that  a  stable  controller  can 
be  designed  using  this  approximate  technique. 

Another  class  of  the  literature  takes  advantage  of  the  multiple  time-scale  behaviour  of  air  vehicles.  Lee 
and  Ha  designed  an  autopilot  for  a  Skid-To-Turn  (STT)  missile  by  splitting  the  dynamics  into  slow  and 
fast  components.  The  slow  subsystem  was  composed  of  the  zero  dynamics  and  was  indirectly  controlled  by 
the  controllable  fast  subsystem.  A  similar  approach  was  proposed  by  Lee  and  Ha  wherein  the  normal  form 
of  a  nonlinear  I/O  feedback  linearizable  system  was  transformed  to  a  two  time-scale  system  by  a  change  of 
coordinates.  But  in  this  case  the  fast  subsystem  constituted  the  zero  dynamics,  and  a  modified  composite 
control  scheme  was  employed  to  stabilize  the  complete  system. 

In  addition  to  the  approximate  schemes  described  above,  low  gain  feedback  approaches  have  been  pro¬ 
posed  in  the  literature  for  nonlinear  systems  with  the  upper  triangular  form.  ’  ’  1  The  exact  output  track¬ 

ing  approach  proposed  by  Devasia  et.al  employed  a  combination  of  feed-forward  and  feedback  control. 
The  feed- forward  control  was  found  using  inversion,  given  a  desired  output  trajectory  and  its  higher-order 
derivatives.  This  inversion  is  non-causal  and  requires  the  infinite  time  preview  of  the  complete  output  tra¬ 
jectory.  It  is  computed  offline,  and  the  inversion  computes  the  desired  input-state  trajectory  that  would 
lead  to  asymptotic  output  tracking.  The  linear  feedback  control  is  employed  to  locally  stabilize  the  internal 
dynamics.  This  approach  was  extended  to  require  a  finite  time  preview  of  the  output  and  was  applied  to  the 
benchmark  VTOL  landing  example.16 

Summarizing  these  previous  results,  internal-state  feedback  is  necessary  to  stabilize  a  non-minimum  phase 
system.  Moreover,  exact  output  tracking  is  achieved  when  the  desired  internal  state  trajectory  is  tracked. 
Motivated  by  this  fact,  this  paper  develops  an  exact  output  tracking  control  technique  for  non- minimum 
systems  using  singular  perturbation  methods.  The  paper  makes  three  major  contributions.  First,  the  output 
dynamics  are  not  required  to  have  a  well-defined  relative  degree  with  respect  to  the  input.  The  idea  is  to 
take  a  sufficient  number  of  derivatives  of  the  output  and  cast  the  system  in  a  singularly  perturbed  form.  This 
procedure  forces  the  internal  states  of  the  system  to  behave  as  the  fast  variables.  It  also  allows  the  internal 
states  to  be  used  as  ‘pseudo-control  variables’  for  output  tracking.  A  sequential  procedure  is  developed 
to  compute  the  internal  states  that  ensure  asymptotic  output  tracking  and  the  controller  is  designed  to 
force  the  internal  states  to  follow  the  computed  trajectory.  The  second  contribution  is  a  full-state  feedback 
controller  that  is  designed  online,  and  is  independent  of  any  particular  operating  condition  and  desired 
output  trajectory.  Third,  the  controller  so  designed  is  causal  and  does  not  require  any  knowledge  or  preview 
of  the  output  trajectory  beforehand. 
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The  paper  is  organized  as  follows.  Section  II  describes  the  class  of  systems  considered  and  formulates  the 
control  problem.  Section  III  develops  the  nonlinear  control  design  and  analyses  stability  of  the  closed-loop 
system.  In  Section  IV  the  methodology  is  illustrated  with  application  to  two  benchmark  problems:  the 
beam-ball  example  that  is  slightly  non-minimum  phase  and  does  not  have  a  well-defined  relative  degree,  and 
the  classic  CTOL  aircraft  non-minimum  phase  problem.  Conclusions  are  presented  in  Section  V. 

II.  Problem  Statement 

The  dynamical  system  considered  is  the  nonlinear  affine  in  control  dynamical  system  expressed  as 

x{t)  =  f(x(t))  +  g(x(t))u(i)  (1) 

y(t)  =  h(x(i))  (2) 

where  x(t)  E  Mn  is  the  vector  of  state  variables,  y(t)  E  W  is  the  output  vector,  and  u (t)  E  W  is  the  vector 
of  control  variables.  The  vector  fields  f(.),  g(.),  and  h(.)  are  sufficiently  smooth.  The  control  objective  is  to 
ensure  that  the  output  asymptotically  tracks  a  sufficiently  smooth,  time-varying,  bounded  trajectory,  such 
that  y (t)  -A  yd(t)  as  t  -A  oc.  It  is  assumed  that  the  nonlinear  system  considered  satisfies  the  following 
assumptions: 

Assumption  1:  The  system  described  by  Eqs.1-2  is  non- minimum  phase. 

Assumption  2:  The  output  of  the  nonlinear  dynamical  system  considered  does  not  have  a  well-defined 
relative  degree  with  respect  to  the  control  variables. 

Assumption  3:  The  output  dynamics  are  differentially  flat  or  the  number  of  control  inputs  available  is 
equal  to  the  number  of  output  variables  to  be  controlled. 

Assumption  4:  The  desired  output  trajectory  and  its  higher-order  derivatives  are  bounded. 

Henceforth  the  time-dependency  notation  is  dropped  for  convenience. 

III.  Tracking  Control  Development 

The  system  dynamics  Eqs.1-2  are  expanded  and  written  in  the  form 


X 

=  f(x)  +  g(x)u 

(3a) 

yi 

=  fti(x) 

V2 

=  h2(x) 

Up 

=  Mx) 

(3b) 

Let  the  relative  degree  of  the  outputs  (2/1 , 2/2 9  •  -  -  ?  2/p )  be  (rq,  rq,  •  •  • ,  rp)  respectively,  and  let  r  =  rq  +7*2  +  . . .+ 
rp.  Note  that  in  this  context  the  relative- degree  is  defined  as  the  number  of  derivatives  of  the  output  required 
such  that  the  control  appears  linearly.  The  control  influence  may  be  singular.  The  system  is  cast  into  normal 
form  using  the  procedure  shown  in  Reference  1.  Define  =  y\~x  for  i  =  1, . . .  ,p  and  j  =  1, . . .  ,7q  and 
denote 

t(t)  =  (yi,y[1\---,y[\y2,---,y22,---,ypp)T 

=  &P)T  (4) 
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Let  rj  e  R"  r  denote  the  set  of  internal  states  such  that  (£,  rf)  form  a  new  set  of  coordinates  for  the  nth  order 
system  Eqs.3.  In  these  new  coordinates  the  system  dynamics  become 


c  =  & 

Cri-i  =  Ct  (5a) 

Ci  =  fyCCv)  +  QyCCv)  u 

V  =  ty(Cv)+Sv(Cv)^  (5b) 


for  i  =  1, . . . , p.  It  has  been  shown  in  Reference.il  that  the  normal  form  of  Eqs.5  can  be  cast  in  the 
singularly  perturbed  form.  This  procedure  shows  that  the  internal  dynamics  constitute  the  fast  subsystem. 
The  singular  perturbation  parameter  e  is  introduced  in  the  system  of  Eqs.  5  to  emphasize  that  the  internal 
states  evolve  faster  than  the  other  states. 


a  =  c 


>-1  =  Ci 

(6a) 

Ci  =  fy,(Cv)  +  9yCCv)  u 
=  fy(Cv)+grl(C'n)u 

(6b) 

It  is  desired  that  the  slow  states  £(£)  follow  the  desired  output  trajectory  while  the  fast  states  rj  remain 
bounded  for  all  time.  This  problem  has  been  studied  in  the  literature  as  a  control  problem  of  asymptotic 
tracking  of  the  slow  states  for  a  special  class  of  systems  in  which  the  fast  dynamics  are  linear  in  the  fast 
states.  In  the  present  work  a  control  algorithm  that  leads  to  global  asymptotic  tracking  is  developed  and 
demonstrated. 


A.  Control  Design 

In  geometric  singular  perturbation  theory,1  the  behaviour  of  singularly  perturbed  systems  is  determined 
using  geometric  constructs  of  the  reduced-order  models,  which  are  obtained  by  substituting  e  =  0  in  Eqs. 6. 
This  results  in  two  subsystems 

Reduced  Slow  Subsystem : 

a  =  c 

e,- ,  =  e,  (7a) 

Ci  =  fvi(Cw)+ffvi(Cv)u 

0  =  iv(Cv)  +gr,(Cv)u  (7b) 

Reduced  Fast  Subsystem : 

ci  =  o 

Cli-i  =  0  (8a) 

C  =  o 

l'  =  fv(Cv)  +  Ev(Cv)u  (8b) 


where  '  represents  derivatives  with  respect  to  the  fast  time  scale:  t  =  The  dynamics  of  the  resulting 
reduced  slow  subsystem  is  restricted  to  r  dimensions  and  constrained  to  lie  upon  an  Mo:  n  —  r  dimensional 
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smooth  surface  defined  by  the  nonlinear  algebraic  set  of  equations  Eq.7b.  This  surface  is  identically  the  fixed 
points  of  the  reduced  fast  subsystem  Eq.8b.  If  the  the  reduced  fast  subsystem  of  Eqs.8  is  stable  about  this 
smooth  surface,  then  conclusions  about  the  stability  of  the  complete  system  Eqs.6  can  be  made  by  studying 
the  reduced  slow  system  Eqs.7. 

Note  that  the  smooth  surface  cannot  be  analytically  computed,  and  in  addition  there  maybe  several 
such  surfaces  that  satisfy  the  algebraic  set  of  equations.  In  order  to  obtain  the  unique  surface  that  the  fast 
variables  must  be  stable  about  it  is  assumed  that  the  fast  states  are  the  pseudo  control  variables  for  the 
reduced  slow  subsystem.  The  control  variables  may  then  be  computed  by  ensuring  that  the  fast  states  follow 
the  computed  surface.  For  convenience  rewrite  the  complete  system  Eqs.6  in  compact  form 


r  = 

fy(£A  +  gy(£,f?)u 

(9a) 

efi  = 

f*j(£,»?)  +  g*7(£,»7)u 

(9b) 

with  the  reduced  slow  subsystem  written  as 

r  = 

fy(£A  +Sy(£Au 

(10a) 

0  = 

fr,(€,r])  +gr,(£,V)u 

(10b) 

Let  (yd?  y^,  •  •  • ,  Yd)T  denote  the  vector  of  the  desired  output  trajectory  and  its  r  order  derivatives.  To 
ensure  that  y d  is  an  asymptotically  stable  equilibrium  of  the  reduced  slow  system,  define  a  positive-definite 
and  decrescent  Lyapunov  function  that  satisfies  the  following: 


Condition  1.  V(t,£  —  yd)  :  [0,  oo)  x  Dy  -A  M  is  continuously  differentiable  and  Dy  G  Mr  contains  the 
origin,  such  that  0  <  —  VdW)  <  V(t,  £  —  yd)  <  *02  (|  |£  —  Vd\ |)  for  some  class  JC  functions  Vh(-)  and  V^O)- 

Design  a  manifold  y  =  ?7d(£,yd,  u)  such  that  the  closed-loop  reduced  slow  system  Eq.lOa  satisfies 

Condition  2. 

dV  dV  9 

-Qt+~d(;  +Sy^,Vd)u]  <  -aif/>3(£-yd),  «i  >  0 

where  ^(O  is  a  continuous  positive-definite  scalar  function  that  satisfies  03(0)  =  0. 


Conditions  1-2  complete  the  design  of  the  controller  for  the  reduced  slow  subsystem.  Note  that  the 
manifold  Mo  :  y d  computed  above  is  a  function  of  the  control  u,  which  is  unknown.  It  is  known  that 
the  complete  system  will  have  the  properties  of  the  reduced  slow  subsystem  if  the  fast  state  asymptotically 
stabilizes  about  the  fast  state  trajectory  y d-  This  condition  is  enforced  by  designing  the  control  u.  Define 
the  error  ev  =  y  —  yd  and  rewrite  Eq.9b  as 

ev' =  v)  +  §»?(£>  ^)u  (11) 

Define  a  positive-definite  and  decrescent  Lyapunov  function  that  satisfies 

Condition  4.  W(t,  £  —  yd,  e^)  :  [0,  00)  x  Dy  x  Dv  -A  M  is  continuously  differentiable  and  D v  C  Mn_r 
contains  the  origin,  such  that 


0  <  •MIMI)  <  W{t^~  yd,e^)  <  02(||e^||) 
for  some  class  /C  functions  0i(.)  and  02(0- 

and  design  u  such  that  the  closed- loop  reduced  fast  system  Eq.  11  satisfies 

Condition  5. 

[t)(£>  v)  +  g»}(£>  ^)U]  ^  ~a3<t>3(ev)’  «3  >  0 

where  03  0)  is  a  continuous  positive-definite  scalar  function  that  satisfies  <£3(0)  =  0. 
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B.  Stability  Analysis 

The  following  theorem  summarizes  the  main  result  of  the  paper. 

Theorem  1:  Suppose  the  control  u  is  designed  according  to  the  Conditions  1  —  6  and  the  nonlinear  system 
Eqs.1-2  satisfies  Assumptions  1-4.  Then  for  all  initial  conditions  (£  —  yd,  e^)  £  Dy  xDv  the  control  uniformly 
asymptotically  stabilizes  the  non-minimum  phase  system  and  equivalently  drives  the  output  y  (t)  -A  y d{t) 
for  all  positive  constants  e  <  e*,  where  e*  satisfies  the  inequality  Eq.17. 


Proof.  The  closed-loop  complete  system  in  the  error  coordinates  (e  =  £  —  yd)  is  given  as 


er  =  F(e,er?  +%,y<i,u) 

(12a) 

eev  =  G(e,ev  +  r)d,yd,u)  -  er)d 

(12b) 

Rearrange  the  closed-loop  system 

to  form 

er 

=  F(e,7?d,yd,u) 

+  [F(e,e^  +  %,yd,u)-F(e,7?d,yd,u)] 

(13a) 

eev 

=  G(e,ev+r]d,yd,Su)  -  erid 

(13b) 

Next,  closed- loop  system  stability  of  the  states  is  analyzed  using  the  composite  Lyapunov  function  ap¬ 
proach.  Consider  a  Lyapunov  function  candidate  for  the  complete  closed-loop  system 

v(t,  e,  e^)  =  V(t,  e)  +  W (t,  e,  e^)  (14) 

From  the  properties  of  V  and  W  it  follows  that  v(t,  e,  e^)  is  positive-definite  and  decrescent.  The  derivative 
of  v  along  the  trajectories  of  Eq.13  is  given  by 


+ 


dV  dV  r 
dt  +  de 6 
dW  dW 


ldW 


dt  de  6  e  de„  Br] 


(15) 


Suppose  that  Lyapunov  functions  V  and  W  also  satisfy  both  conditions 


Condition  5. 


dV 

—  [F(e,  ev  +  rjd ,  yd,  u)  -  F(e,  r]d,yd,  u)]  <  Pi ^(e)^^);  pi  >  0 


Condition  6. 

dW  r  dW 

dt  +  de 


dW  dr] d 
de v  de 


er  <  7103^)  +  yS2'03(e)03(e77); 71  >  0,/32  >  0 


Using  Conditions  1-6  Eq.15  now  becomes 

z>  =  -oq03(e)2  +  Pi 03 (e)03(e^)  -  ^03(e^)2  +  7i03(e??)  +  p2^s(e)<l>3(erj) 

Rearrange  (16)  to  get 


v  < 


where  T  = 


03 

03 


and 


ol  1  -|[di+d2] 

-\Hh  +d2]  ^-71 


(16) 

(17) 


and  A  is  positive-definite  for  e  <  e*.  By  definition  of  the  continuous  scalar  functions  03  and  03  it  follows 
that  z>  is  negative  definite.  Using  the  composite  Lyapunov  approach  it  is  concluded  that  y (t)  -A  yd(t) 
asymptotically.  Since  the  desired  trajectory  is  assumed  to  be  smooth  and  bounded  with  bounded  first-order 
derivatives,  the  control  commands  u  remain  bounded  for  all  time.  □ 
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IV.  Numerical  Examples 


A.  Purpose  and  Scope 

The  preceding  theoretical  developments  are  demonstrated  with  simulation  for  two  benchmark  problems.  The 
first  example  is  the  beam-ball  example  which  fails  to  have  a  well-defined  relative  degree.  The  objective  is  to 
ensure  that  the  ball  remains  in  contact  with  the  beam  and  tracks  any  trajectory  from  a  class  of  admissible 
trajectories.  A  step-by-step  procedure  of  controller  development  is  detailed  for  the  system  and  the  closed- loop 
results  are  studied  for  a  time- varying  trajectory.  The  second  example  develops  control  laws  for  a  nonlinear 
CTOL  aircraft  problem.  The  objective  of  this  example  is  to  test  the  performance  of  the  controller  for  a 
nonlinear,  non-minimum  phase  benchmark  problem. 


B.  Low-Order  Nonlinear  System  Tracking:  The  Beam-Ball  Example 

The  setup  consists  of  a  beam  that  can  only  rotate  in  a  vertical  plane  by  applying  a  torque  at  the  center  of 
the  beam,  and  a  ball  that  is  free  to  roll  along  the  beam.  It  is  desired  that  the  ball  always  remains  in  contact 
with  the  beam  and  that  the  rolling  occurs  without  slipping.  The  goal  is  to  track  any  trajectory  from  a  class 
of  admissible  trajectories.  The  dynamical  system  is  taken  from  Reference  3: 


X\ 

X2 

X3 

X4 

X2 

"  0  " 

B(x  1X4  —  Gsinx  3) 

+ 

0 

X4 

0 

0 

_  1  _ 

(18) 


y  =  x  i  (19) 

where  x  =  {x\,  X2,  X3,  x^)T  =  ( r,r,9,9 ),  y  =  x  1  where  r  is  the  distance  of  the  ball  from  the  center  of  the 
beam,  and  0  is  the  roll  angle  of  the  beam.  The  constants  M  and  J &  are  the  mass  and  moment  of  inertia  of 
the  ball,  J  is  the  moment  of  inertia  of  the  beam,  R  is  the  radius  of  the  ball,  G  is  the  acceleration  due  to 
gravity,  and  B  is  defined  as  M^J^/R2  +  M).  The  torque  of  the  system  is  related  to  the  control  u  by 

r  =  (Mr2  +  J  +  Jb)u  +  2 MrrO  +  MGr  cos  0  (20) 

The  system  output  is  required  to  track  a  desired  trajectory  yd(t)  asymptotically. 


1.  Control  Design 

Following  the  procedure  detailed  in  Section  III  the  system  Eq.18  is  cast  in  the  normal  form 


y  =  x  1  (21a) 

V  =  x  2  (21b) 

y  =  B(x  ix2  —  Gsinx  3)  (21c) 

y  —  B(x\x  2  —  Gx^cosxz  +  2x1X4  u)  (21d) 

±4  =  u  (21e) 


Eqs.21  are  clearly  non-minimum  phase  since  a  feedback  linear izable  control  cannot  stabilize  the  internal 
state  X4.  Additionally,  if  the  angular  velocity  of  the  beam  is  zero  and/or  the  ball  is  at  the  center  of  the 
beam,  the  control  influence  in  Eq.21d  is  zero. 

The  system  is  nondimensionalized  to  determine  whether  or  not  it  exhibits  multiple  time-scale  behaviour. 
Let  (to,  xiq ,  £20,  x3Cb  ^40,  ^0)  indicate  the  reference  quantities  for  time  and  states  of  the  system  Eqs.18.  Then 
the  non-dimensional  quantities  can  be  represented  as  the  ratio  of  the  actual  quantities  over  their  respective 


7  of  15 


American  Institute  of  Aeronautics  and  Astronautics 


reference;  for  example  t  etc.  Thus,  the  non-dimensionalized  equations  are  given  as 


xi 


x2 


x3 


£4 


^20^0 

- X2 

X 10 

—  (BxxoxIqXxxI  -  BGsin(x3ox3)) 
X20 

tox^o  „ 

- X4 

X30 

to 

- UoU 

X40 


(22a) 

(22b) 

(22c) 

(22d) 


Note  that  the  reference  quantities  (to,  £10,  ^20,  uo)  are  all  of  0(1),  whereas  (£30,^40)  are  of  0(0.1)  or  even  less 
as  they  represent  angular  quantities  in  radians.  So  it  can  be  seen  that  the  evolution  of  the  states  (4q,  x2,  x3) 
is  of  0(1)  whereas  the  evolution  of  the  state  X4  is  of  0(  1/0.1)  which  is  much  faster.  Therefore  it  is  concluded 
that  the  state  X4  evolves  at  a  much  faster  rate  when  compared  to  the  other  states  of  the  system. 

As  a  result  of  the  above  analysis,  a  desired  internal  state  trajectory  X4 d  is  computed  such  that  the  output 
asymptotically  tracks  the  desired  trajectory.  The  reduced  slow  system  for  this  example  is  given  by  the 
following  equations 


y  = 

xi 

(23a) 

y  = 

x2 

(23b) 

y  = 

B(x  1X4  —  Gsinx  3) 

(23c) 

y  = 

B(x\x  2  —  GX4COSX3  +  2x1X4  u) 

(23d) 

0  = 

u 

(23e) 

which  simplifies  to 

y 

=  Xi 

(24a) 

y 

=  X2 

(24b) 

y 

=  B(x \x\d  —  Gsinx3) 

(24c) 

y 

=  B(x ldx2  -  Gx4dCOSX3) 

(24d) 

Let  the  desired  output  dynamics  v  =  y 

d  +  OL2Vd  +  «i yd  +  «o yd  -  OL2 y  - 

aiy  —  onyij  where  a,  are  positive 

constants.  Note  that  this  choice  of  v  ensures  that  the  output  exponentially  converges  to  the  desired  trajectory. 
Rearranging  v  using  Eqs.24b-24d  gives 

V  =  B(qL2Xi  +  X2)x\d  —  BGx4dCOSX3  (25) 

(3) 

where  v  =  yKd  }  +  a2yd  +  aq yd  +  <ao yd  +  a2BGsinx3  —  ol\x2  —  &oy,  which  is  quadratic  in  the  internal  state 
X4d.  Using  the  procedure  proposed  in  Reference.  21  the  desired  internal  state  is  computed  as 


X4  d 


BGCOSX3  —  ( BG  COS  X3)2  +  4B(ct2Xi  +  X2)v 

2B(a2xi  +  x2) 


(26) 


In  order  to  enforce  the  condition  that  the  internal  state  follows  the  desired  X4d,  the  control  variable  u  is 
designed  such  that  the  fast  subsystem 


x'4  =  u 


(27) 


is  asymptotically  stable  about  X4d.  Since  the  control  appears  linearly,  proportional  control  is  chosen 


u  =  -/3(x4  -  x4d) 
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2.  Results  and  Discussion 


The  desired  trajectory  is  yd(t)  =  Acos(^f)  with  A  =  1,2.  The  constants  are  chosen  as  =  6,  aq  = 
12,  ao  =  8,  /?  =  8.  Note  that  these  constants  are  chosen  such  that  the  time-scale  behaviour  is  preserved  in 
the  closed- loop  system.  The  control  torque  tau  is  assumed  to  have  a  time  constant  of  0.05s  and  position 
limits  of  ±1.  Figures  1-5  present  the  simulation  results.  The  position  output  and  the  tracking  error  is  shown 
in  Figures  1-2.  Notice  that  after  the  transient  settles  out  perfect  position  tracking  is  achieved.  This  perfect 
output  tracking  indicates  that  the  internal  states  are  bounded  and  follow  their  desired  values  closely,  as  seen 
in  Figures  3-4.  The  error  between  the  desired  internal  state  x ^  and  the  actual  system  response  for  both 
the  cases  is  within  ±0.001.  The  control  input  required  to  accomplish  the  exact  output  tracking  is  shown  in 
Figure  5.  Notice  that  the  torque  computed  is  bounded  and  within  constrained  limits.  The  peaks  around  the 
first  few  seconds  are  due  to  the  arbitrarily  chosen  initial  conditions,  and  not  the  equilibrium  solution  for  the 
system. 


Figure  1.  Position  of  Ball,  Beam-Ball  Example 


Figure  2.  Tracking  Error,  Beam-Ball  Example 


Figure  3.  Roll  Angle, Beam-Ball  Example 


Figure  4.  Internal  State,  Beam-Ball  Example 


C.  Nonlinear  Aircraft  Tracking 

The  purpose  of  this  example  is  to  test  the  performance  of  the  proposed  controller  for  the  longitudinal  axis 
Conventional  Take-off  and  Landing  (CTOL)  Aircraft,  which  is  a  Douglas  DC-8.  The  aircraft  model  has 
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three  degrees-of- freedom:  horizontal  and  vertical  position  (x,  z),  and  pitch  attitude  angle  0.  The  objective  is 
to  control  the  translational  kinematics  while  stabilizing  the  unstable  rotational  dynamics.  The  two  available 
controls  are  thrust  iq,  and  pitching  moment  U2-  The  aircraft  model  is  described  by  the  following  first-order 
differential  equations: 


x 

u 

z 

w 

0 

Q 


u 

cos  0  (ui  —  D  cos  a  +  L  sin  a)  —  sin  0  (—Fy  +  D  sin  a  +  L  cos  a) 

m 

w 

sin  0  {u\  —  D  cos  a  +  L  sin  a)  +  cos  0  (~Fy  +  D  sin  a  +  L  cos  a) 

m 


Q 

U2 

J 


(28a) 

(28b) 

(28c) 

(28d) 

(28e) 

(28f) 


where  u,w  are  the  forward  and  vertical  velocities,  q  is  the  pitch  rate  and  a  =  0—ta,n~1  ^  is  the  angle-of-attack. 
The  aerodynamic  forces  and  physical  constants  are  chosen  and  given  asm  =  85000 kg,  J  =  4  *  10 6kgm2, 
g  =  9.81ms-2,  L  =  cll{u 2  +  w2)(  1  +  ca),  D  =  cld{u 2  +  w2)(  1  +  b(  1  +  ca)2)  with  Fy  =  cll  = 

aD  =  ^5  b  =  0.01,  and  c  =  6.  The  system  of  Eqs.28  is  given  in  the  desired  compact  normal  form.  The 
non-minimum  phase  characteristic  is  due  to  the  pitching  moment  inducing  a  parasitic  downward  force  onto 
the  system. 


1.  Controller  Design 

The  first  step  in  the  control  design  is  to  write  Eqs.28  in  singularly  perturbed  form.  To  determine  whether  the 
rotation  0  can  be  employed  as  a  control,  a  time-scale  analysis  similar  to  the  beam-ball  example  is  carried  out. 
Let  the  reference  quantities  be  denoted  as  (to,  xo,  zo,  uo,  wo,0o,  qo),  (uio,U2o),  and  (Do  =  Lq  =  u\ o  =  Fy o). 
The  non-dimensionalized  equations  are  given  as 


x 

u 


toUo 

Xo 

Toto 


muo 


(cos(0)(ui  —  Dcos(a)  +  Lsin(a))  —  sin (0)(—Fy  +  Dsm(a)  +  L  cos 


(29a) 

(29b) 
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Z  = 

W  = 

0  = 

Q  = 


tpw0 

z0 

to^o 


w 


(30a) 


mw  o 
to 
Qo 
to^20 


-00 


Jqo 


(sm(0)(ui  —  D  cos(a)  +  L  sin(a))  +  cos  (0)(—Fy  +  D  sin  (a)  +  L  cos(a))>)  —  —g  (30b) 
V  J  w  o 

q  (30c) 

u2  (30d) 


Let 


tpWQ 

Zo 


Lptp 

tpLp 

mup 

mwp 

=  1.  This  is  a  valid  assumption  as 


mu  o 

to 


also  has  units  of  force.  Similarly, 


tpUp 

Xp 


=  1. 


tjL0o 

qo  u 


=  1  and 


tpU2P 

Jqo 


=  -  is  a  very  large 


Next,  since  the  angular  quantities  are  small, 
quantity.  Thus,  it  can  be  concluded  that  the  rotational  dynamics  evolve  faster  and  the  pitch  rate  evolves 
faster  than  the  translational  velocities,  where  (x,  y,u,w,  6)  evolve  at  a  rate  of  0(1).  This  conclusion  permits 
the  assumption  of  pitch  attitude  angle  as  the  ‘pseudo-control’.  Thus  pitch  rate  is  the  control  input  for  the 
desired  pitch  attitude  angle. 

Let  eu  =  u  —  Ud  and  ew  =  w  —  Wd  denote  the  errors  between  the  actual  and  the  desired  output  and 
rewrite  the  system  of  Eqs.28  as 


x 

eu 

z 

0 


u 

cos  6  (u\  —  D  cos  a  +  L  sin  a)  —  sin  0  (—Fy  +  D  sin  a  +  L  cos  a) 

m 

w 

sin  9  (u\  —  D  cos  a  +  L  sin  a)  +  cos  6  {—Fy  +  D  sin  a  +  L  cos  a) 

m 


-  ud 


-g-wd 


q 

U2 

J 


(31a) 

(31b) 

(31c) 

(31d) 

(31e) 

(31f) 


where  e  is  introduced  to  signify  the  time  difference.  Let  6d  and  qd  indicate  the  desired  internal  states.  Thus, 
the  resulting  reduced  slow  subsystem  becomes 


x 

eu 

z 

&w 

6 

0 


u 

cos  6  (ui  —  D  cos  a  +  L  sin  a)  —  sin  6  (~Fy  +  D  sin  a  +  L  cos  a) 

ud 

m 

w 

sin  0  (u\  —  D  cos  a  +  L  sin  a)  +  cos  0  (—Fv  +  D  sin  a  +  L  cos  a) 

- y- - g  -  wd 

rn 


qj 

U2 

J 


that  further  simplifies  to 
x  =  u 

cosOd  (u\  —  Dcosa  +  Lsnm)  —  sin^  (Dsina  +  Lcosa) 
eu  =  t±d 

rn 

z  =  w 

sin  Od  (u\  —  D  cos  a  +  L  sin  a)  +  cos  6d  (D  sin  a  +  L  cos  a) 

ew  = - g  -wd 

m 

o  =  qd 


(32a) 

(32b) 

(32c) 

(32d) 

(32e) 

(32f) 


(33a) 

(33b) 

(33c) 

(33d) 

(33e) 
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Using  trigonometric  identities  and  rearranging 


x  =  u 

z  =  w 

u\  cos  Od  —  D  cos(6d  —  a)  —  L  sin(#  —  a) 

&u  =  'U'd 

rn 

u\  sin  0d~  D  sin  (Od  —  a)  +  L  cos  (Od  —  ol) 

e™  = - g  ~Wd 


m 


o  =  Qd 


(34a) 

(34b) 

(34c) 

(34d) 

(34e) 


In  order  to  force  the  errors  to  asymptotically  approach  the  origin,  design  the  desired  Od  and  thrust  u\  such 
that 


u\  cos  Od  —  D  cos(6d  —  ol)  —  L  sin(0  —  a) 

Al  =  'U'd 

m 

u\  sin  Od  —  D  sin  (Od  —  a)  +  L  cos  (Od  —  ol) 

-X2ew  = - g  -wd 

m 


(35) 

(36) 


Notice  that  Eqs. 35-36  are  independent  of  qd  and  hence  it  can  be  computed  as 


qd  =  -W0-0d )  (37) 

This  procedure  completes  the  design  of  the  controller  for  the  slow  subsystem.  The  fast  subsystem  controller 
now  needs  to  be  designed  such  that  the  fast  state  q  follows  qd  asymptotically.  This  can  be  achieved  by 
computing  the  required  moment  as 

u2  —  -A4J (q  (**  qd)  (38) 

In  the  equations  above  A  denotes  the  desired  closed- loop  characteristics. 


2.  Results  and  Discussion 

The  control  objective  is  to  perform  a  climbing  maneuver  that  tracks  a  constant  velocity.  The  forward 
velocity  is  commanded  to  be  constant  at  145ms-1  and  the  vertical  velocity  is  chosen  as  Wd  =  sin(g^). 
The  closed-loop  characteristics  are  chosen  such  that  the  time-scale  properties  are  preserved:  Ai  =  4,  A2  =  4, 
A3  =  4  and  A4  =  6.  The  actuators  are  assumed  to  have  first-order  dynamics  with  a  0.05s  time  constant. 
The  nonlinear  equations  Eqs. 35-36  were  solved  using  the  constrained  optimizer  f  solve  in  MATLAB  with 
arbitrarily  chosen  initial  conditions  of  (100,  0.05)  for  thrust  and  0d  respectively.  The  small  angle  assumption 
was  made  for  angle- of- attack  to  ease  the  computational  burden.  The  results  are  presented  in  Figures  6-12. 
Figures  6-7  compare  the  forward  and  vertical  velocities  to  their  respective  desired  references.  Close  tracking 
is  demonstrated  with  an  error  of  0.002ms-1  in  forward  velocity  and  ±0. 049ms-1  in  vertical  velocity.  The 
control  commands  are  presented  in  Figures  8-9.  Thrust  is  seen  to  settle  down  to  its  equilibrium  value  of 
3.694  x  1081V  while  the  moment  varies  accordingly  to  provide  sufficient  upward  force.  As  expected  the 
directions  of  the  vertical  velocity  and  the  applied  moment  are  opposite:  positive  moment  induces  a  negative 
downward  force  and  reduces  the  vertical  velocity  to  its  desired  value.  Therefore,  for  the  first  60  seconds  the 
moment  is  negative,  after  which  it  changes  sign.  Perfect  output  tracking  indicates  that  the  internal  aircraft 
states  are  stable.  This  behaviour  is  seen  in  Figure  10-11.  The  pitch  attitude  angle  (Figure  11)  is  bounded 
and  behaves  as  expected.  A  climb  produces  an  increase  in  pitch  attitude  angle,  and  a  descent  produces 
a  negative  value.  The  pitch  rate  behaviour  seen  in  Figure  10  agrees  with  the  commanded  trajectory.  In 
comparison  with  results  published  in  Reference  22,  this  exact  internal  state  trajectory  was  obtained  using  the 
offline  technique  proposed  by  Devasia  et.al.  The  complete  two-dimensional  trajectory  is  shown  in  Figure 
12. 


V.  Conclusions 

A  control  formulation  for  output  tracking  of  a  general  class  of  nonlinear  non-minimum  phase  systems 
was  developed.  The  desired  internal-state  reference  and  feedback  control  to  stabilize  the  unstable  internal 
dynamics  were  posed  as  an  asymptotic  slow  tracking  problem  for  singularly  perturbed  systems.  Controller 
performance  was  demonstrated  through  numerical  simulation  for  two  benchmark  nonlinear  examples. 
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Figure  6.  Forward  Velocity,  Aircraft  Example 


Figure  7.  Vertical  Velocity,  Aircraft  Example 


Figure  9.  Applied  Moment  (after  transient),  Aircraft 
Figure  8.  Applied  Thrust,  Aircraft  Example  Example 


Figure  10.  Pitch  Rate,  Aircraft  Example 


Figure  11.  Pitch  Attitude,  Aircraft  Example 
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Figure  12.  Two-Dimensional  Trajectory,  Aircraft  Example 


Based  on  the  results  presented  in  the  paper,  the  following  conclusions  are  drawn.  The  tracking  error  for 
the  beam-ball  example  was  demonstrated  to  remain  within  |0.03|  at  all  times,  and  perfect  output  tracking 
was  demonstrated.  This  perfect  output  tracking  was  a  result  of  perfect  internal  state  tracking  that  was 
achieved  by  the  nonlinear  feedback  law.  This  same  behaviour  was  also  seen  for  the  aircraft  example,  where 
the  tracking  error  was  within  1 0.002 1  for  the  forward  velocity  and  1 0.049 1  for  the  vertical  velocity.  For  both 
of  these  benchmark  problems,  the  controller  demonstrated  asymptotic  tracking  irrespective  of  the  desired 
reference  trajectory.  The  controller  was  causal  and  did  not  require  any  preview  of  the  desired  reference. 
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Tracking  Control  Design  for  Non-Standard  Nonlinear  Singularly 

Perturbed  Systems 

Anshu  Siddarth  and  John  Valasek 


Abstract —  Tracking  control  laws  for  a  general  class  of  nonlin¬ 
ear  singularly  perturbed  systems  are  developed.  No  assumptions 
concerning  the  nonlinearity  of  the  system  is  made.  The  effect 
of  the  different  speeds  of  controllers  and  nonlinear  actuator 
dynamics  is  studied  and  asymptotic  stabilization  is  shown 
using  Lyapunov  methods.  Design  procedure  and  performance 
of  the  proposed  technique  is  evaluated  against  composite 
control  method.  Results  indicate  that  the  proposed  technique 
applies  both  to  standard  and  non-standard  forms  of  singularly 
perturbed  systems. 

I.  INTRODUCTION 

Analysis  and  control  of  singularly  perturbed  systems  has 
received  considerable  attention  in  literature  [1].  The  common 
approach  is  to  design  two  separate  controllers  for  each  of  the 
two  lower-order  subsystems  and  then  apply  their  composite 
or  sum  to  the  full-order  system.  The  composite  control 
technique  [2]  guarantees  asymptotic  stability  for  standard 
singularly  perturbed  systems  or  for  systems  wherein  the 
algebraic  problem  has  a  unique  root  for  the  fast  variables  in 
the  region  of  interest.  In  literature  this  assumption  is  satisfied 
by  either  assuming  that  a  unique  root  for  the  fast  states  exists 
[3]  or  assuming  that  the  system  dynamics  is  nonlinear  only 
in  the  slow  states  [4].  However,  this  root  is  a  set  of  fixed 
points  of  the  fast  dynamics  expressed  as  a  smooth  function 
of  the  slow  variables  and  the  control  inputs,  and  hence  is 
not  always  unique  nor  guaranteed  to  exist.  Consequently 
one  is  required  to  choose  an  isolated  manifold  in  order  to 
design  a  stabilizing  control  structure  for  the  slow  subsystem. 
This  not  only  requires  substantial  system  knowledge  but 
also  restricts  the  results  to  a  local  domain.  Furthermore, 
analytical  determination  of  this  manifold  is  restricted  by 
the  nonlinearity  of  the  system.  In  such  cases,  it  has  been 
shown  that  only  ultimate  boundedness  of  the  signals  maybe 
concluded  [5]. 

This  paper  proposes  an  alternate  approach  for  control 
design  of  non-standard  forms  of  singularly  perturbed  sys¬ 
tems.  The  proposed  approach  avoids  analytical  computation 
of  the  manifold  by  considering  it  as  an  additional  control 
variable.  This  idea  is  motivated  by  singularly  perturbed  sys¬ 
tems  such  as  aircraft  wherein  the  fast  states  appear  linearly 
in  the  slow  dynamics.  Reference  [6]  successfully  designed 
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nonlinear  flight  trajectories  using  angular  rates  as  control 
variables,  although,  the  effect  of  control  variables  on  the  slow 
variables  was  neglected.  More  recently  ultimate  uniformly 
bounded  results  were  concluded  [7]  using  similar  ideas  while 
assuming  that  the  set  of  nonlinear  algebraic  equations  can  be 
solved  for  the  control  variables  and  the  fast  controller  was 
designed  using  gain- scheduling. 

This  paper  makes  three  major  contributions.  First,  the  ap¬ 
proach  developed  here  employs  the  reduced-order  technique 
without  imposing  any  assumptions  about  the  solutions  of  the 
transcendental  equations  or  the  effect  of  the  control  variables. 
By  computing  the  slow  manifold  upon  which  the  fast  states 
must  be  restricted  for  asymptotic  tracking  and  ensuring  that 
this  manifold  is  the  equilibrium  of  the  system  uniformly, 
control  objective  is  accomplished.  Second,  controllers  with 
different  speeds  are  addressed  in  comparison  to  composite 
control  technique  that  requires  all  control  variables  to  be 
sufficiently  fast.  Third,  the  control  laws  are  computed  using 
Lyapunov-based  designs  that  are  able  to  capture  the  nonlinear 
behaviour  that  is  lost  in  the  linearization  of  the  system. 
Owing  to  this,  the  global  or  local  nature  of  results  are 
relaxed  from  the  complexities  of  analytic  construction  of 
the  manifold  and  are  entirely  a  consequence  of  the  choice 
of  underlying  controllers  for  the  reduced-order  models. 
Additionally,  the  control  laws  developed  in  this  paper  are 
independent  of  the  singular  perturbation  parameter  and  an 
upper  bound  for  the  scalar  perturbation  parameter  is  derived 
as  a  sufficiency  condition  for  asymptotic  stability. 

The  paper  is  organized  as  follows.  Section  II  mathemat¬ 
ically  formulates  the  control  problem  and  briefly  reviews 
the  necessary  concepts  from  geometric  singular  perturbation 
theory.  Control  laws  and  the  main  results  of  the  paper  are 
detailed  in  Section  III.  Section  IV  studies  several  numerical 
examples  and  qualitatively  analyses  the  performance  and 
design  procedure  of  the  proposed  technique.  Conclusions  are 
discussed  in  Section  V. 

II.  PROBLEM  DESCRIPTION  AND  MODEL 
REDUCTION 

A.  System  Description 

The  class  of  nonlinear  singularly  perturbed  dynamical 
systems  addressed  in  this  paper  are 


x  =  f  (x,  z,  6); 

xer^er 

(la) 

h  = 

$1  G  M*,  lli  C  M/ 

(lb) 

ez  =  g(x,z  ,S,e); 

zer 

(1c) 

e<^2  =  fa2(^2,u2); 

S2  ElH,u2  eRH 

(Id) 
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where  x  is  the  vector  of  slow  variables,  z  is  the  vector  of  fast 
variables,  S  =  is  the  vector  of  actuator  commands 

input  to  the  system,  u  =  [ui,  u2]T  G  W  is  the  input  vector 
that  is  to  be  computed  and  e  G  Mis  the  singular  perturbation 
parameter  that  satisfies  0  <  e  <<  1  and  is  unknown.  All  the 
vector  fields  are  assumed  to  be  sufficiently  smooth  and  p  > 
m.  The  control  objective  is  to  drive  the  slow  state  so  as  to 
track  sufficiently  smooth,  bounded,  time- varying  trajectories 
or,  x(t)  xr(t)  as  t  oc. 

The  controls  have  been  separated  into  vectors  Si  and  S2 
to  consider  the  different  speeds  of  the  control  variables,  with 
Si  representing  the  actuators  with  slow  dynamics  and  S2 
representing  actuators  with  relatively  fast  actuator  dynamics. 
The  vector  fields  f(51(.)  and  f$2(.)  represent  their  actuator 
dynamics  respectively.  The  model  given  in  (1)  represents  the 
special  case  of  two  time- scale  dynamical  systems.  The  design 
procedure  developed  here  also  applies  to  multiple  time- scale 
systems  of  the  following  form 

x  =  f(x,  z,5) 
tih  =  fijO^Ui) 
e2z  =  g(x,  z,  8,  e2) 

£3^2  =  f(52(c>2,  U2) 

where  ei,  e2  and  63  are  singular  perturbation  parameters  of 
different  orders  that  satisfy  63  <  e2  <  ei. 

B.  Reduced-Order  Models 

The  system  considered  in  (1)  is  labeled  the  Slow  System 
and  the  independent  variable  t  is  called  the  slow  time- scale. 
This  system  is  equivalently  written  as 


x' 

=  ef  (x,  z,  6) 

(2a) 

V 

=  efg1  (<5i,Ui) 

(2b) 

t! 

=  g(x,  z,  <5,  e) 

(2c) 

=  f52(^2,U2) 

(2d) 

where  '  represents  derivative  with  respect  to  r  = 

t—t0 

e 

and  to  is  the  initial  time.  Equation  (2)  are  labeled  the 
Fast  System  and  the  independent  variable  r  is  called  the 
fast  time-scale.  Geometric  singular  perturbation  theory[8] 
examines  the  behaviour  of  these  singularly  perturbed  systems 
by  studying  the  geometric  constructs  of  the  reduced-order 
models  which  are  obtained  by  substituting  e  =  0  in  (1)  and 
(2).  This  results  in: 


Reduced  Slow  Subsystem: 


x  =  f(x,  z,6) 

(3a) 

h  =  f,v,  (>>j.  in) 

(3b) 

0  =  g(x,  z,  5, 0) 

(3c) 

0  =  f<52(52,  u2) 

(3d) 

Reduced  Fast  Subsystem: 

II 

o 

Ov 

M 

II 

o 

(4a) 

z'  =  g(x,  z,<5, 0) 

(4b) 

M 

S 

M 

cw 

II 

^  M 

(4c) 

The  dynamics  of  the  resulting  reduced  slow  subsystem  are 
restricted  to  m  +  l  dimensions,  constrained  to  lie  upon  an 
n  +  p  —  l  dimensional  smooth  manifold  defined  by  the  set  of 
points  (x,  z,  S)  G  Mm  x  Mn  x  W  that  satisfy  the  algebraic 
equations  (3c), (3d): 

M0  '■  z  =  z(x, <5i, c52);  S2  =  <52(u2)  (5) 

This  set  of  points  is  identically  the  fixed  points  of  the  reduced 
fast  subsystem  (4b)-(4c).  Thus  the  manifold  Mo  is  invariant 
[9].  Furthermore,  the  flow  on  this  manifold  is  described  by 
the  differential  equations 

X  =  f(x,z(x,<5i,<52),<5i,<52(u2))  (6a) 

<*i  =  f^(<5i,ui)  (6b) 

if  the  reduced  fast  subsystem  is  stable  about  the  manifold 
Mo-  If  the  dynamics  of  (6)  are  locally  asymptotically 
stable  about  the  manifold,  then  it  can  be  concluded  that  the 
complete  system  (1)  is  also  locally  asymptotically  stable  [9]. 

III.  CONTROL  FORMULATION  AND  STABILITY 
ANALYSIS 

Stability  properties  of  the  slow  system  depend  upon  the 
identification  of  the  manifold  Mo-  In  general,  the  nonlinear 
set  of  algebraic  equations  (3c), (3d)  possess  multiple  roots 
and  the  manifold  Mo  may  take  any  of  these  values;  hence 
it  is  not  unique.  One  approach  to  ensure  uniqueness  is  to 
consider  the  fast  state  as  another  control  variable.  These  ideas 
are  mathematically  formulated  and  analyzed  in  this  section. 

A.  Control  Design 

The  first  step  is  to  transform  the  problem  into  a  non- 
autonomous  stabilization  control  problem.  Define  the  track¬ 
ing  error  signal  as  e(t)  =  x(t)  —  xr(t)  and  express  the  slow 


system  as 

e  =  F(e,  z,  xr,  xr,  S)  (7a) 

Si  =  fSl(Si,  UX)  (7b) 

ez  =  G(e,  z,  xr,  5,  e)  (7c) 

eS2  =  f<52feu2)  (7d) 

where  F(e,  z,  xr,  xr,  S)  =  f(e  +  xr,z ,S)  —  xr  and 


G(e,  z,  xr,S)  =  g(e  +  xr ,  z,  S,  e)  are  Lipschitz  on  a  domain 
of  the  state-space.  Using  the  procedure  described  in  Section 
II,  the  reduced  slow  subsystem  for  set  of  equations  in  (7)  is 
obtained  as 


e 

=  F(e,  z,  xr,  xr,  (5) 

(8a) 

Si 

=  f^((5i,ui) 

(8b) 

0 

=  G(e,  z,xr,<5, 0) 

(8c) 

0 

=  fs2(S2,u2) 

(8d) 

In  order  to  ensure  e  =  0  is  an  asymptotically  stable 
equilibrium  of  the  reduced  slow  system  (8)  define  a  positive- 
definite  and  decrescent  Lyapunov  function  that  satisfies 
Condition  1.  V(t,e)  :  [0,  oc)  x  Dx  R  is  continuously 
differentiable  and  Dx  C  Mm  contains  the  origin,  such  that 

o  <  V’i(IMI)  <  V(t,e)  <  V>2(||e||) 


221 


for  some  class  /C  functions  AO)  and  AO)- 

Let  S2r  represent  the  manifold  of  the  equation  (8d)  that  is 
defined  shortly.  Design  a  manifold  z  =  zr(e,  xr,  xr,  £2r)  and 
control  Si  =  ^ir(e,  xr,  xr,  S2 r)  such  that  the  slow  state  error 
system  (8a)  satisfies 

Condition  2. 

^  +  ^“F(e>  Zr>  Xr,  Xr,  <^lr,  ^2r)  <  -ttlV’K6)’  >  0 

where  AO)  is  a  continuous  positive-definite  scalar  function 
that  satisfies  A(O)  =  0. 

The  next  step  is  to  design  control  ui  that  ensures  the 
actuator  state  asymptotically  approaches  Sir.  Define  the  error 
in  actuator  state  as  :=  Si  —  Sir  and  rewrite  the  reduced 
slow  error  subsystem  (8a), (8b)  as 


The  complete  system  will  have  the  properties  of  the  reduced 
slow  subsystem  if  the  fast  state  asymptotically  stabilizes 
about  zr.  This  condition  is  enforced  by  designing  the  mani¬ 
fold  S2r.  Define  the  error  in  the  fast  state  vector  ez  :=  z  —  zr 
and  rewrite  (12b)  as 

ez  =  G(e,  ez,  xr ,  ,  S2r ,  0)  (13) 

while  noting  that  zr'  =  ezr  =  0  for  the  reduced  fast  sub¬ 
system.  Define  a  positive-definite  and  decrescent  Lyapunov 
function  that  satisfies 

Condition  4.  W (£,  e,  ,  ez)  :  [0,  oc)  x  Dx  x  Ds1  xDz-^M 
is  continuously  differentiable  and  Dz  C  Mn  contains  the 
origin,  such  that 

o  <  <MI|ez||)  <  W(t,e,eSl,e z)  <  02(||ez||) 


e  =  F(e,zr,xr,xr,(5ir,(52r)+  (9a) 

F(e,  Zr,  Xr,  Xr,  ,  t)2r)  F(g,  Zr,  Xr,  Xr,  ()lr,  $2r) 

e<5!  =  fii^i.ui)  -  5lr  (9b) 


where  Sir,  the  derivative  of  the  manifold  is  given  as 


•  d5lr  .  d5lr  .  d5lr ..  d5lr  • 

Sir  =  ^e+rt  +  ^Xr+^r 


de 

(Mir 

de 


<9xr 


<9xr 


di- 


VO) 


2r 


F(e,  Zr,  Xr,  Xr,  Oi,  o2r)  +  — — Xr  +  "m  xr 


9xr 


9xr 


using  (9a)  and  the  fact  that  S2r  is  a  fixed  point  of  the  reduced 
slow  subsystem  as  it  satisfies  equation  (8d).  Conditions 
1  —  2  ensure  that  the  slow  error  is  asymptotically  stabilized 
by  the  slow  actuator  variable  Sir.  In  order  to  ensure  the 
system  remains  asymptotically  stable  when  es1  ^  0,  define 
a  combined  positive-definite  decrescent  Lyapunov  function 
for  equations  (9a), (9b)  such  that  Vs(t,  e,  e^)  :  [0,  oo)  x 
Ac  x  D§1  M  is  continuously  differentiable  and  D§1  C  M1 
contains  the  origin 


Vs(t,e,eSl)  =  V(t,e)  +  ~eSlTeSl 


(11) 


and  design  ui  such  that  the  closed-loop  reduced  slow  system 
(9)  satisfies 

Condition  3. 


dV s_ 
dt 


dVs  .  ,  dVs  . 


a2  >  0 


where  AO)  is  a  continuous  positive-definite  scalar  function 
that  satisfies  A(0)  =0. 

Conditions  1  —  3  complete  the  design  of  control  for 
the  reduced  slow  subsystem.  Notice  that  the  manifold 
zr(e,  xr,  xr,  S2r)  computed  in  the  above  control  design  is 
a  function  of  the  manifold  S2r  which  is  unknown.  From 
the  discussion  detailed  in  Section  II,  it  is  known  that  this 
manifold  is  a  fixed  point  of  the  reduced  fast  subsystem, 


e'  =  0;  Si  =  0  (12a) 

zl  =  G(e,  z,  xr,  S,  0)  (12b) 

St!  =  f<52feu2)  (12c) 


for  some  class  /C  functions  AO)  and  AO)* 

and  design  S2v  such  that  the  closed-loop  reduced  fast  system 

(13)  satisfies 

Condition  5. 

G(e,  ez,  xr,  <5i,  <52r,  0)  <  -a30 §(ez),  a3  >  0 

where  AO)  is  a  continuous  positive-definite  scalar  function 
that  satisfies  03  (0)  =  0. 

Thus,  the  last  step  in  the  design  procedure  is  to  enforce 
that  the  fast  actuators  asymptotically  stabilize  about  S2r  and 
the  closed-loop  reduced  fast  subsystem  is  uniformly  stable. 
Define  the  error  in  the  fast  actuator  states  e$2  :=  S2  —  S2r 
and  rewrite  the  closed-loop  reduced  fast  subsystem  in  the 
error  coordinates 

ez/  =  G(e,  ez,  xr,  <5i,  <52r,  0)+  (14a) 

G(e,ez,xr,5i,(52,0)  -  G(e,  ez,  xr,  <5i,  <52r,  0) 
eS2'  =  f«2(<52,u2)  -  S2r'  (14b) 

using  the  fact  that  the  slow  variables  remain  constant  in  the 
fast  time  scale  and 


x  /  _  dS2r  ^ 

Ar  —  TT 

oez 


(15) 


Define  a  positive-definite  decrescent  combined  Lyapunov 
function  Wf(t,  e,  e$1?  ez,  e$2)  :  [0,  oo)  x  Ac  x  Ai  x  Dz  x 
Ds2  M  for  the  reduced  fast  subsystem  (14)  that  is 
continuously  differentiable  and  D$2  C  Rp~l  contains  the 
origin 


Wf(t,e,eSl,ex,eg3)  =  W(t,e,eSl,ex)  +  -egje^  (16) 

Design  u2  such  that  the  closed-loop  reduced  fast  system  (14) 
satisfies 

Condition  6. 

dWf  ,  dWf  ,  /2,  N  /2,  x 

~deez  +  ~des~e62  ~  ~  ^404(e^)5  a4  >  0 
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B.  Stability  Analysis 

The  following  theorem  summarizes  the  main  result  of  the 
paper. 

Theorem  1.  Suppose  the  control  u  of  the  system  (1)  is 
designed  according  to  the  Conditions  1  —  14.  Then  for  all 
initial  conditions,  (e,  e§1 ,  ez,  e$2)  G  Dx  x  Ds1  x  Dz  x 
the  control  uniformly  asymptotically  stabilizes  the  nonlinear 
singularly  perturbed  system  (1)  and  equivalently  drives  the 
slow  state  x(t)  xr(t)  for  all  e  <  e*,  that  is  defined  by  the 
inequality  given  in  (24). 

Proof:  The  closed-loop  complete  system  in  the  error 
coordinates  is  given  as 


e  =  F(e,  ez  +  zr ,  xr,  xr,  e<5i  +  Sir,  e$2  +  ^2r)  (17a) 

=  fSl  (e<5i  +  Slr,  Ui)  -  SlrC  (17b) 

eez  =  G(e,  ez  -T  zr,  xr,  g^  -T  (iiD  g §2  4-  S2r , 

—  ezr  (17c) 

ees2  =  U-2  (e<s2  +  S2r ,  u2)  -  eS2rc  (17d) 


where  the  subscript  ‘C’  is  added  to  note  the  difference 
between  (17b)-(17d)  and  (10)  and  (15).  These  expressions 
for  the  complete  system  are  noted  as 


JlrC  = 


J2rC  = 


dS ir  OS ir  . 

+  — — e  + 


dt 

85ir 

dxr 

dS2r 

dt 

dS2r 

de6l 


de 

dSlr ..  dSlr 
xr  +  -77^  xr  +  — — ez 


dxr 
.  dS2 r  . 


dez 

ds2 r  .  3J2r  .. 

-Xr  +  "777  Xr 


9xr 


dxr 


.  . 

e(5i  +  — ez 

dez 


(18) 


(19) 


Closed-loop  system  stability  of  the  system  states  is  analyzed 
using  the  composite  Lyapunov  function  approach[10].  Con¬ 
sider  a  Lyapunov  function  candidate 


is(t,e,eSl,ez,es2)  =  Vs(t,e,eSl)  bb  Wf(t,e,es1,ez,es2) 

(20) 

for  the  complete  closed-loop  system.  From  the  properties  of 
Vs  and  Wf  it  follows  that  z /(£,  e,  e^,  ez,  e<52)  is  positive- 
definite  and  decrescent.  The  derivative  of  v  along  the  trajec¬ 
tories  of  (17)  is  given  by 


dV8  dVs  _ 

dt  de  6  des1 


dVs  .  dWf  dWf 
'  e6l  +  +  -^e 


gW) 


iaw> 

e  <9e. 


dt 
ldWf 

e  des2 


de 


eS2 


(21) 


Note  that  the  vector  fields  in  (17a)  and  (17c)  can  also  be 
expressed  as 


F(g,  ez  Zr,  xr,  Xr,  G^  $ir,  G(52  ~b  $2r)  — 

F(e,  Zr,  Xr,  Xr,  £ir,  ^2r)  +  F(e,  Zr,  Xr,  Xr,  G^  +  $ir,  $2r) 
F(e,  zr,  xr,  xr,  $ir,  ^2r)  F(g,  zr,  xr,  xr,  ®d'i  “1“  ^ln  ^2r) 
+  F(e  ,  zr  ,  xr  ,  xr  ,  e<5i  +  ^lr ,  ^82  +  ^2r)  (22) 

+  F(e,  ez  zr,  xr,  xr,  g^  -1-  $ir,  e§2  -1-  $2r) 

F(e,  zr,  xr,  xr,  G<5i  Hb  ^lr,  e<52  +  ^2r) 


G(e,  ez  -b  zr,  xr,  g^  -b  $ir,  g^  -b  $217  — 

G(g,  Gz  +  Zr,  Xr,  G^  ~b  ^2r^  0) 

+  G(g,  Gz  zr,  Xr,  G^  +  (^ir,  G(52  +  $217  0) 

G(g,  Gz  ~b  Zr,  Xr,  e<5i  +  ^lr  ?  ^2r  ?  0) 

+  G(g,  gz  +  zr,  xr,  g^  +  ^ir,  e§2  +  S2r,  e) 

G(g,  Gz  zr,  Xr,  G^.,^  “b  ^(52  4“  ^2r?  0)  (23) 


Suppose  that  Lyapunov  functions  Vs  and  Wf  also  satisfy  the 
following  conditions  with  fy  >  0  and  7i  >  0 

Condition  7. 

dV.  / 

F  (^G,  Zr,  Xr,  Xr,  G^  H~  G(52  4"  02r) 

de 

d\2 

-^F(e,zr,xr,xr,e51  +^ir,<52r)  < 

Condition  8. 

dVs  . 

F(G,  Gz  +  Zr,  Xr,  Xr,  G^.,^  +  0ir,  e$2  +  02r) 

dVs 

-^F(e,z r,xr,xr,e5l  +<5ir,e52  +  <52r)  <  Mz^^eZ) 

Condition  9. 


3g^1  de. 

+  7lV’I(e51) 

Condition  10. 

aw> 


ez  <  /?3^3(e)^4(e51)  +  z) 


dez 

dWf 


G(e,  ez  zr,  xr,  xr,  -l-  (5ir,  g,52  -f-  (52r,  c) 

G(e,  ez  +  ?  Xr ,  Xr,  ®ii  +  <5lr)  ®<52  +  ^2d  0) 


dez 

<  e72</>i(ez)  +  e^5^3(e)03(ez)  + 
+  e/374>3(ez)(t>4(eS2) 

Condition  11. 


0W) 


+ 


aw>  _  dWf  dzr 
de  dez  de 


dWf  dWf  dzr 


Lae5l  dez  ae5lJ 


e^i 


dWf  dzr  .  dWfdzr..  ^  0  ,  .  ,,  .  , 

-  &:x'  -  s;51' £  ',3^<e-> + A'ts(e)*(e-) 

+  P9'ip4(es1)(pz{ez) 

Condition  12. 


/ 


9e52 


a<y. 


2r 


a<52r .  a<52r . 

e  +  — — xr 


dS2r .. 

-Xr 


dS2r  . 

^<5i 


9g  9xr  dxr  des1 

74</>4(e<52)  +  /3io^3(e)04(e«2)  +  /?nV;4(e<51)04(e32) 

Condition  13. 

9W>  dS2 r 


< 


96(52  9g; 

aw>  ds2r 


G(g,  gz  +  zr,  xr,  xr,  G<5i  +  ^lr,  e^2  +  ^2r,  e) 
G(g,  Gz  +  Zr,  Xr,  Xr,  G(51  4"  ^ir7  G(52  ~b  ^2r?  0) 


9e(52  9e. 

<  £75 4>l(e52)  +  ePu'ipz (e)<Me.52)  +  ^i3^4(eSl)<f>4(es2) 

+  e^i403(ez)</>4  (e52) 

Condition  14. 

aw>  a<52r 


5e52  5ez 


:zr]  <  e/3i5^3(e)</>4(e<52)  +  e/316ip4(eSl  )<^4(e<s2) 
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Conditions  9  —  14  enforce  restrictions  upon  the  difference 
between  the  complete  system  and  the  reduced  subsystems. 
Use  Conditions  1  —  14  into  (21)  and  rearrange  to  get, 


z>  < 


(24) 


where  4/  = 


03 

04 

03 

04 


and  matrix  K  given  in  (24)  is 


positive-definite  for  e  <  e*.  By  definition  of  the  continuous 
scalar  functions  03,04,  0 3  and  04,  it  follows  that  z>  is 
negative  definite.  By  Lyapunov  theorem  it  is  concluded 
that  (e,  0l,  z,  £2)  =  (0,  ^ir,  zr(0,  xr,  xr),  (i2r)  is  uniformly 
asymptotic  stable  equilibrium  of  the  closed-loop  system  (17). 
Further,  from  definition  of  the  tracking  error,  it  is  concluded 
that  x(t)  xr(t)  asymptotically.  Since  the  desired  trajec¬ 
tory  is  assumed  to  be  smooth  and  bounded  with  bounded 
first-order  derivatives  all  the  other  signals  remain  bounded 
for  all  time.  ■ 


IV.  NUMERICAL  EXAMPLES 
A.  Purpose  and  Scope 

The  purpose  of  this  section  is  to  illustrate  the  preced¬ 
ing  theoretical  developments  and  demonstrate  the  controller 
performance  for  both  standard  and  non-standard  forms  of 
singularly  perturbed  systems.  The  first  example  is  taken  from 
Reference  [2]  and  the  purpose  is  to  see  how  the  proposed 
approach  compares  with  composite  control  technique  for 
standard  singularly  perturbed  systems.  The  objective  of  the 
second  example  is  to  analyze  the  performance  and  robustness 
characteristics  of  the  controller  for  non-standard  forms  of 
singularly  perturbed  systems. 

Example  1:  Standard  Singularly  Perturbed  Model 

The  following  example  is  taken  from  Reference  [2].  The 
objective  is  to  design  a  regulator  to  stabilize  both  the  slow 
and  the  fast  state  in  the  domain  Dx  G  [—1,1]  and  Dz  = 

[-1/24/2]. 

x  =  xz3;  ez  =  z  +  u  (25) 

The  reduced-order  models  for  the  system  under  study  are 


Reduced  Slow  Subsystem 

x  =  xz3\  0  =  z  +  u 

(26) 

Reduced  East  Subsystem 

x'  =  0;  z'  =  z  +  u 

(27) 

Notice  that  the  algebraic  equation  in  the  reduced  slow 
subsystem  has  an  isolated  root  for  the  fast  state;  thus  the 
system  given  is  in  standard  form. 

The  controller  is  designed  using  the  same  Lyapunov 
functions  and  closed-loop  characteristics  as  in  [2].  Using 
V(x)  =  1 x 6  as  Lyapunov  function  for  the  slow  subsystem, 
the  desired  manifold  zr  =  —  X3  satisfies  Condition  2  with 
ai  =  1  and  03 (x)  =  \x\ 5.  The  control  is  designed  as 
u  =  —  3z  —  2x3  to  satisfy  Condition  5  with  Lyapunov 


function  W  =  \{z  —  zr )2,  <43  =  2  and  03 (x,z)  =  \z  —  zr |. 

The  closed-loop  system  with  ez  =  z  —  zr  becomes 

x  =  x(ez  +  zr)3  (28a) 

eez  =  -2ez  +  (ez  +  zr )3  (28b) 

The  inequality  in  (24)  is  satisfied  for  all  e  <  0.4246. 

Notice  that  the  control  law  designed  is  exactly  same  as 
that  obtained  using  composite  control.  However,  the  upper- 
bound  is  conservative  when  compared  to  the  upper-bound 
obtained  using  composite  control  (0.4286).  This  variation 
appears  because  the  coefficients  of  the  composite  Lyapunov 
function  were  chosen  as  unity  in  (20)  instead  of  optimal 
values  as  in  composite  control. 

Example  2:  Non-Standard  Singularly  Perturbed  Model 
Consider  the  following  unstable  linear  system 

x  =  z  —  u]  ez  =  x  +  u  (29) 


The  objective  is  to  stabilize  the  system  about  x  =  0  and 
z  =  0.  Notice  that  the  algebraic  equation  obtained  by  setting 
e  =  0  has  infinitely  many  solutions  and  composite  control 
cannot  be  applied. 

Control  Design:  With  V(x)  =  \x 2  as  Lyapunov  function 
for  the  reduced  slow  subsystem,  manifold  zr  =  u  —  ol\x 
satisfies  Condition  2  for  03 (x)  =  \x\.  Lyapunov  function  for 
the  fast  subsystem  is  W(x,z)  =  \(z  —  zr)2 .  Condition  5  is 
satisfied  with  control  u  =  —  x  —  a>2  (z  —  zr )  and  03  (x,  z)  = 
\z  —  zr\.  The  applied  control  in  original  system  coordinates 
is  given  as 


u  = 


—  1  —  Oi\Oi2 
1—0(2 


X  — 


Oi2 

1—0(2 


(30) 


Substituting  (30)  in  (29)  and  with  a  change  of  coordinates 
gives  the  closed-loop  system 


X  =  —OL\X  + 


0(2 


1—0(2 


1-0(2 
cz  +  e 


1  +  0(1 
1-0(2 


ez  —  0(1  (1  +  0(10 


(31a) 


(31b) 


where  ez  —  z  —  zr.  The  constants  satisfying  Conditions  7-14 
are  02  =  /?5  =  -<*i(l  +  ai)  and  72  =  rest  all 

being  zeros. 

Results  and  Discussion:  The  closed-loop  gains  chosen 
are  a\  =  0.5  and  0:2  =  0.5.  With  these  values,  global 
asymptotic  stabilization  is  satisfied  for  all  e  <  0.2645.  Tabled 
documents  the  closed-loop  eigenvalues  for  different  values 
of  e.  The  closed-loop  system  loses  its  time-scale  property 
for  e  >  0.2645  but  remains  stable  with  complex  conjugate 
eigenvalues  indicating  that  the  upper  bound  e*  satisfying 
condition  (24)  is  conservative.  The  system  becomes  unstable 
for  all  e  >  0.4. 

The  system  given  in  (29)  is  the  linearized  model  of  the 
nonlinear  non-standard  form  [11] 


x  =  tan  z  —  u;  ez  =  x  +  u  (32) 

Notice  that  the  fast  state  appears  nonlinearly  in  the  slow 
dynamics  and  hence  determining  a  manifold  zr  to  meet  the 
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(24) 


a\ 

_§3 


—  \  [/?2  +  k  +  /?6  +  k\ 

_  —  |  [Pi  —  /3io  —  fii2  —  (3 15] 


£3 

2 


«2  -7l 

—  |  [/?4  +  &7  +  /?g] 

|  [Pll  +  A.3  +  /?16] 


—  |  [/?2  +  /?5  +  /?6  +  /%] 

“  ^  [/?4  +  P7  +  /3g] 

V  “  72  -  73 

ffl4 

2 


—  \  [01  —  /3lO  ~  P12  ~  Plb\ 

\  [Pu  +  (3 13  +  Piq\ 

Pl4 

+1 4+75 


TABLE  I 

Example  2:  Closed-loop  Eigenvalues 


e 

Eigenvalues  A 

0.05 

Ax  =  -0.5914,  A2  =  -16.9086 

0.1 

Ax  =  -0.7396,  A2  =  -6.7604 

0.2645 

Aij2  =  —0.6404  =b  1.167.7 

0.35 

Ai,2  =  —0.1786  ±  I.I8I8.7 

0.4 

Ai,2  =  0.000  ±  1.1180j 

0.405 

Ax,2  =  0.0154  ±  l.lllOj 

x  1.5 

I  A 
;-L 


«  0 

(/)  0 

2 

4 

6 

8  10 

"  i 

1  jr 
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W  -4' 
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Time(sec) 


Fig.  1.  Example  2:  Nonlinear  System  (32)  Closed-loop  Response  (e  =  0.1) 


control  objective  is  difficult.  Instead,  use  the  controller  (30) 
that  was  developed  for  the  linear  counterpart.  The  resulting 
closed-loop  system  with  a\  =  ct2  =0.5  is 

x  =  2.5x  +  tanz  +  z\  ez  =  —  l.bx  —  z  (33) 

The  controller  converts  the  non-standard  form  into  standard 
form  which  uniquely  restricts  the  system  onto  the  desired 
manifold,  which  in  this  case  is  =  —  1.5x.  It  is  clear 
that  due  to  the  nonlinear  nature  of  the  problem  the  domain 
of  attraction  is  now  restricted  to  a  subspace  of  the  two- 
dimensional  Euclidean  space.  Using  the  previously  outlined 
procedure,  constants  satisfying  Conditions  7  —  14  are  /^  =  2 
and  72  =  3  with  all  others  being  zero  in  the  domain  Dx  £ 
[0,-1)  and  Dz  £  [—1,2].  The  upper-bound  on  singular 
perturbation  parameter  is  computed  as  e*  =  0.2.  Simulation 
study  in  this  case  indicates  that  stability  is  maintained  for  all 
e  <  0.4  and  the  nonlinear  system  is  asymptotically  stabilized 
in  the  domain  Dx  £  [—2,2]  and  Dz  £  [—1.5,2].  Simulation 
results  for  the  case  of  e  =  0.1  are  shown  in  Fig.l.  Notice 
that  non-zero  control  is  applied  until  the  fast  state  falls  onto 
the  desired  manifold. 


V.  CONCLUSIONS 

In  this  paper,  design  procedure  for  tracking  the  slow  states 
and  stabilization  of  a  general  class  of  nonlinear  singularly 
perturbed  systems  was  developed.  Based  on  the  stability 
proof  and  simulation  results  presented  in  the  paper,  the 
following  conclusions  are  drawn.  First,  the  control  laws 
computed  for  standard  singularly  perturbed  systems  using 
composite  control  [2]  are  seen  to  be  a  special  case  of  the 
proposed  technique.  It  was  also  shown  that  the  upper- 
bound  is  conservative  with  comparison  to  composite  control 
technique  as  fixed  unity  gains  were  used  in  the  formulation 
of  the  composite  Lyapunov  function.  These  gains  can  be 
chosen  optimally  as  done  in  composite  control  technique  to 
provide  a  less  conservative  upper-bound.  Second,  simulations 
for  non-standard  singularly  perturbed  systems  shows  that  for 
all  values  of  e  <  e*  asymptotic  convergence  is  guaranteed 
and  all  closed-loop  signals  remain  bounded.  The  domain 
of  convergence  of  the  proposed  technique  was  seen  to  be 
dependent  upon  the  underlying  controllers  developed  for  the 
reduced-order  systems.  It  is  possible  to  guarantee  global 
results  by  identifying  controllers  that  satisfy  Conditions 
1  —  14  for  the  complete  space  spanned  by  the  system  states. 
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Advances  in  unmanned  flight  have  led  to  the  development  of  Unmanned  Air  Systems 
that  are  capable  of  carrying  state-of-the-art  video  capturing  systems  for  the  intended  pur¬ 
pose  of  surveillance  and  tracking.  These  vehicles  have  the  capability  to  fly  through  a  target 
area  with  a  mounted  camera  and  allow  humans  to  operate  both  the  UAS  and  the  camera 
to  attempt  to  survey  any  objects  that  are  deemed  targets.  These  systems  have  worked  well 
when  controlled  by  humans,  but  having  them  operate  autonomously  to  reduce  operator 
workload  and  manpower  is  even  more  challenging  when  the  camera  is  fixed  to  the  airframe 
instead  of  being  mounted  on  a  gimbal,  so  that  the  aircraft  must  be  steered  in  order  to  steer 
the  camera.  The  presence  of  winds  must  also  be  accounted  for.  This  paper  develops  an 
algorithm  for  surveillance  of  ground  targets  by  UAS  with  fixed  pan  and  tilt  cameras,  in  the 
presence  of  winds.  This  paper  develops  an  algorithm  for  surveillance  of  ground  targets  by 
UAS.  The  specific  RL  algorithm  used  is  Q- learning,  and  the  objective  of  the  approach  is 
to  bring  any  target  located  in  an  image  captured  by  a  camera  into  the  center  of  the  image 
using  the  learned  control  policy.  The  learning  agent  determines  offline  (initially)  how  to 
control  the  UAS  and  camera  to  get  a  target  from  any  point  in  the  image  to  the  center  and 
hold  it  there.  A  feature  of  this  approach  is  that  the  learning  agent  will  continue  to  learn 
and  refine  and  update  the  previously  offline  learned  control  policy,  during  actual  operation. 
Results  presented  in  the  paper  demonstrate  that  the  approach  has  merit  for  autonomous 
surveillance  of  ground  targets. 


I.  Introduction 

One  way  to  introduce  the  concept  of  autonomy  to  the  Unmanned  Air  System  (UAS)  motion  video  tracking 
problem  is  to  determine  a  control  policy  that  is  capable  of  controlling  the  UAS  autonomously  along  a  certain 
trajectory,  while  having  the  camera  controlled  by  a  human.  Another  way  is  to  do  the  opposite,  and  have  the 
UAS  flown  manually  while  the  camera  gimbals  to  capture  and  track  identified  targets.  Both  of  these  methods 
have  been  explored  before  and  have  merit,  but  having  both  the  UAS  and  the  camera  operated  autonomously 
could  provide  greater  flight  and  tracking  efficiency.  Having  a  system  that  is  capable  of  controlling  a  UAS 
and  camera  system  to  keep  a  selected  target  visible  in  the  camera  screen  would  free  the  human  supervisor 
to  focus  on  selecting  viable  targets  and  analyzing  the  images  received. 

The  biggest  challenge  stems  from  the  need  to  determine  an  optimal  control  policy  for  keeping  the  target 
in  the  middle  of  the  image,  using  a  fixed  pan  and  tilt  camera  in  the  presence  of  winds.  Conventional 
control  techniques  require  determining  an  appropriate  cost  function  and  then  finding  the  weights  that  make 
the  control  optimal.  Although  finding  the  optimal  control  is  often  straightforward,  determining  the  cost 
function  that  best  describes  the  problem  is  not  straightforward.  For  this  research,  Reinforcement  Learning 
(RL)  is  utilized  for  the  determination  of  the  optimal  control  policy  that  will  both  gimbal  the  camera  and 
steer  the  UAS  to  provide  target  tracking. 

Reinforcement  Learning  (RL)  is  a  subset  of  machine  learning  techniques  that  have  been  implemented  in 
similar  control  policy  learning  scenarios  with  success.  It  does  not  require  the  declaration  of  a  cost  function 
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because  it  learns  the  optimal  control  policy  based  on  physical  interaction  with  the  environment  rather  than 
mathematical  approximations  of  the  problem.  RL  algorithms  break  the  problem  down  into  a  set  of  states 
and  actions  for  attempting  to  reach  a  goal,  and  receive  rewards  from  the  environment  for  meeting  those 
goals.  Given  a  particular  state  of  the  system,  feasible  actions  are  chosen,  and  the  governing  control  policy  is 
updated  based  on  what  kind  of  reward  was  given  for  the  results  of  the  action.  Over  time,  the  learning  agent 
is  able  to  converge  to  the  optimal  control  policy  for  achieving  the  desired  goal  from  each  state. 

The  specific  RL  algorithm  to  be  used  is  Q-learning1  modified  with  an  Adaptive  Action  Grid  (AAG). 
The  AAG  method  was  developed  by  Lampton  and  Valasek2,3  as  a  means  to  provide  greater  accuracy  in 
reaching  the  goal  state  (i.e.,  the  target),  while  also  decreasing  the  size  (dimensions)  of  the  state-space  to  be 
considered.  This  dramatically  decreases  the  total  number  of  states  in  the  system,  so  that  the  learning  time 
becomes  more  feasible  and  the  storage  requirements  more  tractable.  Consider  a  typical  optical  or  infrared 
(IR)  image  which  contains  background  features  plus  a  target  located  in  a  Region  of  Interest  (ROI).  It  is 
assumed  here  that  a  human  operator  would  identify  the  ROI  and  the  target  in  the  image,  although  the 
algorithm  could  conceivably  identify  the  ROI  and  target  by  itself,  based  upon  ROI  and  target  characteristics 
supplied  by  the  human  operator.  The  AAG  software  agent  then  discretizes  the  entire  state-space  (i.e.  image) 
with  course  grid  spacing,  followed  by  a  finer  discretization  of  just  the  ROI,  which  is  the  area  of  the  image  in 
the  immediate  vicinity  of  the  target.  By  not  discretizing  the  entire  state-space  with  a  fine  resolution,  there 
are  fewer  state- action  pairs  to  both  learn  and  store  in  memory.  However,  more  precision  is  needed  to  reach 
the  goal  (target),  so  multiple  levels  of  finer  discretization  are  used  in  the  ROI  as  the  learning  agent  gets 
closer  to  the  goal. 

This  paper  develops  an  algorithm  for  surveillance  of  ground  targets  by  UAS.  The  specific  RL  algorithm 
used  is  Q-learning,  and  the  objective  of  the  approach  is  to  bring  any  target  located  in  an  image  captured  by 
a  camera  into  the  center  of  the  image  using  the  learned  control  policy  described  above.  The  learning  agent 
will  determine  offline  (initially)  how  to  control  the  UAS  and  camera  to  get  a  target  from  any  point  in  the 
image  to  the  center  and  hold  it  there.  A  feature  of  this  approach  is  that  the  learning  agent  will  continue  to 
learn  and  refine  and  update  the  previously  offline  learned  control  policy  during  actual  operation. 

II.  Algorithm  Development 

Reinforcement  Learning  is  a  process  of  learning  through  interaction  in  which  a  program  uses  previous 
knowledge  of  the  results  of  its  actions  in  each  situation  to  make  an  informed  decision  when  it  later  returns  to 
the  same  situation.  It  is  a  method  that  has  been  used  for  many  diverse  problems  ranging  from  board  games 
to  behavior-based  robotics.  The  purpose  of  the  learning  agent  used  in  RL  is  to  maximize  the  long-term 
cumulative  reward,  not  just  the  immediate  reward.5  In  this  work  the  goal  is  to  remain  in  the  image,  with  a 
preference  for  being  far  from  the  edges.  The  reward  structure  must  be  set  up  to  effect  this  desire.  The  agent 
uses  the  knowledge  gained  by  reward  maximization  to  update  a  control  policy  that  is  a  function  of  the  states 
and  actions.  This  control  policy  is  essentially  a  large  matrix  that  is  composed  of  every  possible  state  for  the 
rows,  and  every  possible  action  for  the  columns.  The  three  most  commonly  used  classes  of  RL  algorithms  are 
Dynamic  Programming,  Monte  Carlo,  and  Temporal  Difference.5  The  majority  of  Dynamic  Programming 
methods  require  an  environmental  model,  making  the  use  of  them  impractical  in  problems  with  complex 
models.  Monte  Carlo  only  allows  learning  to  occur  at  the  end  of  each  episode,  causing  problems  that  have 
long  episodes  to  have  a  slow  learning  rate.  Temporal  Difference  methods  have  the  advantage  of  being  able 
to  learn  at  every  time  step  without  requiring  the  input  of  an  environmental  model.  The  most  commonly 
used  method  of  Temporal  Difference  is  known  as  Q-learning,  with  the  most  common  variation  being  Watkins 
Q-learning.6  Q-learning  is  an  on-policy  form  of  Temporal  Difference  that  utilizes  an  action-value  function 
update  rule  based  on  the  equation:5 

^t)  *  5  ^t)  T  ^[u+i  T  T  m£kx  ,  cq_|_i)  ^t)\  (I) 

a 

where  st  is  the  current  state,  at  is  the  current  action,  st+i  is  the  next  state,  at± %  is  the  next  action,  Q  is 
the  action- value  function  (used  in  the  control  policy),  and  the  k  subscript  signifies  the  current  policy.  The 
parameter  a  is  a  parameter  that  is  used  to  “penalize”  the  RL  algorithm  when  it  repeats  itself  within  each 
episode.  The  parameter  7  is  the  future  policy  discount  factor,  a  and  7  are  design  parameters  that  are  kept 
constant  for  this  work.  To  utilize  RL  for  this  problem  the  proper  representation  of  the  environment  as  the 
parameters  of  the  Watkins  Q-learning  algorithm  must  be  made.  States,  goals,  rewards,  and  actions  must 
be  designed.  The  states  of  the  RL  agent,  s,  are  defined  for  this  problem  to  be  those  states  of  the  system 
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that  either  give  information  regarding  the  target’s  position  or  those  of  the  UAS  that  can  be  controlled  for 
tracking.  This  yields  a  state-space  consisting  of  3  variables:  target  x-position  in  the  image,  target  y-position 
in  the  image,  and  UAS  bank  angle. 


s  = 


X 


Y 


(2) 


The  goal  for  an  RL  agent  is  defined  using  the  reward  structure.  The  overall  goal  is  to  have  the  target 
remain  in  the  image  frame  once  commanded,  so  a  proper  set  of  reward  requirements  must  be  constructed. 
Since  leaving  the  image  frame  is  considered  the  worst  result,  it  retains  the  worst  reward.  In  this  case,  a 
value  of  r  =  —20  is  given  for  a  target  leaving  the  image  frame.  It  is  also  desired  to  remain  away  from  the 
edges  of  the  image,  so  a  reward  of  r  =  —  5  is  given  for  hitting  the  image  boundary,  while  a  positive  reward 
of  r  —  +20  is  awarded  for  reaching  the  center  of  the  image.  This  encourages  the  RL  agent  to  move  the  UAS 
such  that  the  target  stays  as  far  from  any  edge  of  the  image  frame  as  is  possible.  All  other  possible  states 
yield  a  neutral  reward  of  r  =  0.  Since  the  goal  is  generally  defined  for  the  RL  agent  as  the  state  that  yields 
the  highest  reward,  the  goal  is  defined  as  all  states  where  the  target  image  position  is  at  the  center. 


9  = 


0  0 


(3) 


The  agent  for  this  problem  is  limited  in  its  control  of  the  states  because  the  target  global  position  is 
independent  of  any  action  the  UAS  takes.  The  only  part  of  the  environment  that  the  UAS  can  control  is 
itself.  Based  on  the  assumptions  for  this  problem,  the  only  way  the  UAS  can  control  the  position  of  the 
target  in  the  image  frame  is  to  change  its  bank  angle.  Therefore,  the  action-space  for  this  problem  is  defined 
to  be  increments  in  the  UAS  bank  angle,  where  for  this  problem  Ac/)  =  2  degrees. 


-2  0  2 

This  formulation  of  the  RL  problem  is  therefore 

I  =  \Xi  Yi  4 


iT 


=  Acj)  0A  (j)  +A0 
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(4) 

(5) 

(6) 
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III.  Simulation  Results 


III. A.  Disturbance  Free 

All  learning  takes  place  offline,  and  after  a  sufficient  number  of  learning  episodes  have  been  completed  the 
control  policy  performance  and  robustness  is  evaluated  via  Monte  Carlo  simulation.  Test  cases  consisting  of 
a  fixed  target  tracking  scenario  and  a  moving  target  tracking  scenario  are  presented.  For  each  case,  target 
position  in  the  image  frame  and  time  histories  of  the  UAS  states  are  displayed.  Monte  Carlo  simulations  are 
presented  for  a  chosen  timespan  of  100  seconds.  This  represents  the  typical  amount  of  time  for  the  controller 
to  position  the  aircraft  in  a  stable  tracking  configuration.  The  RL  agent  was  allowed  to  run  for  1,000,000 
episodes,  with  Monte  Carlo  snapshots  taken  at  a  few  places  beginning  at  500,000  episodes.  The  Monte  Carlo 
randomization  places  the  initial  position  of  the  target  in  one  of  the  four  quadrants  of  the  image  frame,  and  at 
a  random  position  within  each  quadrant.  The  controller  must  then  steer  the  UAS  so  that  ideally,  the  target 
is  driven  to  the  center  of  the  image  frame.  One  representative  case  is  provided  for  each  of  the  four  quadrants. 
Images  positions  are  given  in  pixels  and  aircraft  bank  angles  are  given  in  degrees.  Aircraft  inertial  positions 
are  in  meters. 

The  results  are  taken  from  one  single  test  case  of  the  Monte  Carlo  runs,  in  which  the  target  initial 
conditions  place  it  in  the  image  frame  in  quadrant  1.  Figure  1  shows  a  3-dimensional  view  of  the  aircraft 
moving  in  inertial  airspace  tracking  the  target.  As  can  be  seen  in  Figure  1,  the  aircraft  approaches  a  circular 
orbit  to  track  the  stationary  target  on  the  ground.  In  Figure  2,  the  position  of  the  target  at  each  timestep 
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is  displayed.  The  target  begins  in  quadrant  1  of  the  image  and  moves  toward  the  goal  of  the  center  of  the 
image.  However,  it  is  unable  to  remain  there  given  the  aircraft’s  current  state  and  it  is  lost  from  the  center. 
As  the  aircraft  banks  left,  the  target  moves  up  in  the  image  frame,  while  a  right  bank  moves  the  target 
down.  The  forward  motion  of  the  aircraft  causes  difficulty  in  tracking  the  target  in  the  x-direction,  and  this 
is  reflected  in  the  target  almost  being  lost  off  the  left  side  of  the  image.  The  reinforcement  that  the  agent 
has  received  leads  it  to  settle  in  a  state  of  keeping  the  target  in  quadrant  2.  Time  histories  of  the  target 
position  in  the  image  frame,  aircraft  bank  angle,  and  commanded  change  in  aircraft  bank  angle  are  shown 
in  Figure  3.  It  is  seen  that  the  controller  keeps  the  target  in  the  image  frame  throughout  the  simulation 
timespan. 
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Figure  3.  State  Time  History:  Stationary  Target 


With  a  successful  orbit  of  a  moving  target  learned,  the  RL  agent  was  then  presented  the  task  of  learning 
to  follow  a  target  that  moves.  This  target  moves  in  a  straight  line  at  60  mph,  the  same  speed  as  the  cruise  of 
the  aircraft.  Under  this  condition,  the  aircraft  attempts  to  follow  alongside  the  target  as  it  travels  forward. 
In  Figure  4,  it  can  be  seen  that  the  aircraft  begins  following  alongside  the  target  well,  and  begins  to  deviate 
away  from  it  as  time  moves  forward.  This  is  due  to  the  stationary  camera  requiring  tracking  to  be  handled  by 
banking  the  aircraft.  It  can  be  seen  in  Figure  5  that  the  target  is  maintained  in  the  image  frame  throughout 
the  duration  of  this  simulation.  Like  the  stationary  case,  the  target  passes  the  prescribed  goal  of  centering 
in  the  image  and  settles  in  quadrant  2.  The  time  histories  shown  in  Figure  6  reveal  that  to  maintain  this 
tracking  requires  frequent  bank  angle  commands,  unlike  the  stationary  case. 
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Figure  5.  Image  Time  History:  Moving  Target 


III.B.  Wind 

Accounting  for  wind  in  this  problem  has  been  done  using  a  variety  of  methods,7,8  and  accounting  for  wind 
disturbances  using  the  present  method  requires  altering  the  learning  process  in  the  state-space  to  handle 
the  additional  state  information.  Wind  can  be  handled  by  the  learning  agent,  but  it  requires  knowledge 
of  the  wind  speed  and  direction.  This  modification  can  be  done  by  adding  two  new  states  to  the  learning 
state-space.  This  modified  state-space  is  shown  in  Equation  8,  where  vw  and  i/jw  are  the  wind  speed  and 
wind  heading  angle,  respectively. 
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S  = 


T 


(8) 


X  Y  (j)  vw  i>w 

With  wind  added  to  the  simulation,  new  learning  was  experienced  with  random  wind  speeds  and  directions 
initialized  at  the  beginning  of  each  episode.  After  1,000,000  learning  episodes,  Monte  Carlo  results  were 
used  with  the  learned  Q-matrix.  The  following  figures  show  an  example  from  these  Monte  Carlo  results  for 
a  stationary  target  with  wind  disturbances.  The  wind  vector  for  this  particular  simulation  is  13  mph  at  a 
heading  angle  of  45  degrees.  As  can  be  seen  in  Figure  7,  the  aircraft  approaches  the  circular  orbit  as  from 
before,  but  due  to  the  wind  disturbance  it  is  not  nearly  as  smooth  of  a  circular  orbit  as  in  Figure  1.  Figure  8 
shows  that  throughout  the  duration  of  the  simulation,  the  target  does  remain  in  the  image.  By  comparing 
Figure  9  to  Figure  3,  it  can  be  seen  that  many  more  bank  angle  commands  are  required  to  maintain  tracking 
when  there  is  wind,  as  is  expected. 


IV.  Conclusions  and  Future  Work 

Based  on  the  results  presented  in  this  paper,  it  is  concluded  that: 

1.  Algorithm  learning  convergence  for  the  static  target  case  with  fixed  camera  is  rapid,  due  to  the  low 
number  of  state-action  pairs.  The  number  of  learning  episodes  required  to  converge  to  a  “good” 
solution  varies  with  the  particular  case/scenario  being  learned,  but  all  results  have  shown  a  clear  point 
of  diminishing  returns  about  which  running  additional  learning  scenarios  provides  only  a  marginal 
improvement  in  performance. 

2.  For  all  of  the  stationary  target  cases  evaluated,  the  RL  controller  keeps  the  target  in  the  image  frame 
throughout  the  simulation  timespan.  Although  the  target  does  not  stay  in  the  reinforcement  learning 
positive  goal  area  (the  origin),  the  broader  tracking  goal  of  keeping  the  target  in  the  image  frame  itself 
for  a  useful  period  of  time  is  generally  met. 

3.  Camera  installation  and  orientation,  and  the  initial  position  of  the  target  relative  to  the  initial  position 
of  the  aircraft  have  a  strong  influence  on  the  results.  In  each  example,  the  controller  attempts  to  drive 
the  target  into  the  second  quadrant.  This  is  due  to  the  geometry  of  the  scenario  and  the  location  of  the 
camera  in  the  aircraft.  Having  the  target  in  the  second  quadrant  allows  the  aircraft  to  turn  ahead  of 
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Figure  8.  Image  Time  History:  Stationary  Target  with  Wind 


the  target,  in  order  to  keep  it  in  the  image  frame  in  the  future.  Consequently,  if  the  camera  is  oriented 
to  point  out  the  right  side  of  the  aircraft,  the  aircraft  is  steered  to  keep  the  target  in  the  first  quadrant. 

4.  Preliminary  results  for  the  case  of  a  moving  target  show  that  this  method  has  promise  for  learning  to 
track  a  moving  target,  and  merits  further  investigation. 

5.  Preliminary  results  for  wind  results  shows  that  using  wind  measurements  in  the  state-space  is  a  promis¬ 
ing  method  for  accounting  for  wind  in  the  learning  process,  and  merits  further  investigation  alongside 
the  moving  target  scenarios. 
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This  research  will  be  expanded  in  future  work  in  several  ways.  One  expansion  that  is  to  be  explored 
is  the  use  of  a  gimbaled  camera  rather  than  a  fixed-base  camera  mounted  on  the  UAS.  This  will  allow  for 
greater  ease  of  tracking,  but  will  require  a  reimagining  of  the  Reinforcement  Learning  problem.  Another 
extension  that  will  be  explored  is  the  introduction  of  more  action  choices  by  varying  the  UAS  altitude  and/or 
cruise  speed.  This  will  allow  for  learning  to  follow  a  moving  target  that  is  traveling  at  various  speeds  and 
along  a  variety  of  trajectories,  but  will  greatly  increase  the  learning  state-action  pairs.  For  each  expansion 
explored,  the  inclusion  of  wind  considerations  as  done  in  this  paper  will  allow  for  accurate  appraisal  of  the 
UAS  ability  to  track  ground  targets  in  actuality.  With  learned  control  policies  for  each  research  scenario, 
the  final  research  to  be  conducted  in  this  line  will  be  to  evaluate  each  policy  through  flight  testing  on  an 
actual  UAS. 
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Requirments  for  current  and  future  Unmanned  Air  Vehicles  with  longer  flight  en¬ 
durances  have  led  to  a  need  for  an  autonomous  soaring  capability.  This  paper  investigates 
aircraft  flight  path  guidance  for  search  and  localization  of  Regions  of  Interest,  consisting 
of  atmospheric  phenomena.  The  problem  is  posed  as  an  offline  agent  learning  problem, 
of  localizing  atmospheric  thermal  locations  and  then  guiding  an  Unmanned  Air  Vehicle  to 
soar  from  one  to  another.  Q-  learning  is  used  as  the  learning  algorithm.  The  computational 
navigation  solution  used  here  is  a  basic  grid  algorithm  that  assigns  thermal  locations  and 
intensities,  with  the  representation  being  specified  states,  actions,  goals,  and  rewards  that 
are  used  to  accomplish  the  agent  learning.  The  approach  is  validated  with  a  simulation  of  a 
powered  Unmanned  Air  Vehicle.  Results  presented  in  the  paper  show  that  the  autonomous 
agent  can  learn  how  to  navigate  to  a  thermal  quickly  and  efficiently  by  controlling  bank 
angle,  while  simultaneously  monitoring  its  inertial  position  and  heading  angle. 


I.  Introduction 

IN  this  paper,  search  and  localization  guidance  will  initially  be  applied  to  a  powered  UAV  but  will  be 
eventually  used  for  autonomous  soaring  of  UAVs,  a  task  that  can  theoretically  permit  them  to  stay 
airborne  for  many  hours  or  even  several  days  at  a  time.  Thermal  updrafts  and  their  intensities  can  be 
located  and  tracked  to  allow  a  glider  UAV  with  a  nominal  endurance  of  two  hours  to  soar  through  the  air 
for  a  maximum  of  14  hours.  Motion  in  a  fixed-dimension  spiral  path  can  be  used  to  explore  the  atmosphere 
for  thermals,  while  the  aircraft  simultaneously  keeps  track  of  the  updraft’s  target  position.  An  aircraft 
can  easily  make  use  of  a  thermal  gust’s  energy  to  permit  higher  wing  loadings  and  smaller  battery  size, 
which  consequently  facilitates  larger  payload  and  increased  cruise  speed.  Search  and  localization  guidance 
using  thermal  updrafts  can  additionally  be  accomplished  by  centering  a  vehicle  about  a  thermal,  or  using 
inter-thermal  gusts.  Reward  signals  can  be  cast  in  terms  of  net  lift,  while  the  states  are  a  thermal  locator’s 
estimate  of  thermal  size  and  strength.  An  optimum  trajectory  generation  algorithm  that  promotes  greater 
autonomy  of  an  aircraft  can  improve  flight  range  and  endurance  promoted  by  a  thermal  gust. 

Thermal  updrafts’  locations  and  intensities  can  be  mapped  using  on-board  sensors.  Specifically,  infrared 
cameras  can  be  used  to  locate  thermals,  and  real  time  trajectories  for  dynamic  soaring  applications  can  be 
produced. 1  Furthermore,  memory  components  can  map  available  energy  based  on  previous  sensor  readings 
and  a  history  of  learning.  Other  research  presents  maintaining  a  wind  map  based  on  data  collected  during 
flight,  as  well  as  using  currently  available  maps  to  generate  energy-gain  paths.  A  path  generator  can  then 
plan  paths  based  on  energy  efficiency  and  field  exploration. 

Once  thermal  updrafts  are  located,  the  amount  of  vertical  velocity  and  the  drifting  motion  of  the  center 
of  the  thermal  can  be  considered.  Thermal  centering  controllers  can  be  used  to  represent  the  thermal’s 
location  and  size.  Maximum  climb  rate  can  be  achieved  at  the  center  of  the  updraft;  bank  angle  can  be 
changed  based  on  the  climb  rate  to  move  the  aircraft’s  circular  path  to  coincide  with  the  center  of  the 
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updraft.'  A  UAV  can  track  a  moving,  non-predefined  target,  such  as  a  thermal,  by  changing  its  turn  rate 
and  speed.  Tracking  a  moving  thermal  can  be  improved  by  examining  wind  disturbance  rejection  controls.9 

This  paper  develops  a  learning  agent  for  guiding  a  powered  UAV  from  one  thermal  location  to  another 
by  considering  the  control  of  the  UAV  to  be  a  Reinforcement  Learning  problem.  This  is  done  as  a  precursor 
to  autonomous  soaring  using  unpowered  UAVs.  To  initiate  the  Reinforcement  Learning  process,  one  simply 
needs  to  define  a  set  of  states  and  actions  that  are  used  in  a  specific  type  of  Reinforcement  Learning 
algorithm,  known  as  Q-learning.  The  Q-learning  algorithm  involves  updating  a  Q-matrix  that  contains  the 
states,  actions,  and  the  value  of  a  certain  state-action  pair.  The  states  for  this  problem  consist  of  bank  angle, 
heading  angle,  and  position  in  the  global  coordinate  system.  Upon  defining  a  goal,  without  supervision  the 
agent  determines  the  thermals’  intensities  and  ranks  them  in  order  of  highest  intensity  and  smallest  distance 
from  the  aircraft.  If  there  is  only  one  thermal,  that  updraft  becomes  the  goal.  The  aircraft  then  tracks  the 
thermal  and  navigates  to  it  using  the  most  efficient  route.  Rewards  are  based  on  the  global  position  of  the 
aircraft  and  thermals. 


II.  Q-Learning 

Reinforcement  Learning  is  a  machine  learning  technique  that  uses  reinforcing  rewards  from  the  environ¬ 
ment  to  improve  a  policy  that  determines  the  best  states  of  a  system  or  the  best  actions  to  take  given  the 
current  state  of  the  system.  The  most  widely  used  Reinforcement  Learning  technique  is  known  as  Watkins’ 
Q-learning.  Q-learning  is  a  temporal-difference  method  that  uses  the  current  estimate  of  the  action-value 
function  to  determine  how  to  maximize  rewards  from  the  environment  in  an  off-policy  manner.  It  is  called 
off-policy  because  the  policy  being  used  for  decision  making  during  an  episode  is  not  necessarily  the  same 
policy  that  is  being  updated  with  rewards. 

While  the  policy  being  updated  is  typically  a  greedy  policy  utilizing  the  action- value  function,  the  policy 
used  during  an  episode  for  choosing  actions  can  be  exploratory.  An  agent  is  given  a  specific  goal,  and  through 
interaction  with  its  environment  it  will  either  explore  to  discover  better  actions  for  the  future,  or  exploit  the 
knowledge  it  has  already  acquired  to  guarantee  an  increase  in  positive  reward.  An  e-greedy  policy  uses  a 
probability,  e,  to  determine  whether  an  action  will  be  exploration  or  exploitation. 

A  learned  action- value  function,  Q,  is  often  referred  to  as  the  Q-matrix.  The  Q-matrix  contains  nearly  all 
possible  combinations  of  state-action  pairs  once  learning  is  complete.  Therefore,  the  more  states  or  the  more 
actions  that  are  available  to  an  agent,  the  bigger  the  Q-matrix  becomes.  In  Q-learning,  the  final  Q-matrix 
will  be  optimal  if  every  state- action  pair  is  visited  infinitely  many  times.  Since  this  is  not  possible,  learning 
is  typically  ended  once  the  performance  is  deemed  “good  enough”.  As  the  Q-matrix  becomes  large,  more 
learning  episodes  are  required  for  the  agent  to  appropriately  experience  every  state-action  pair.  Along  with 
specific  state-action  pairs  are  corresponding  values  in  the  Q-matrix  based  on  the  amount  and  magnitude  of 
rewards  given.  The  values  contribute  to  the  policy  function.  A  higher  value  listed  in  the  Q-matrix  for  a 
state-action  pair  signifies  a  high  benefit  of  existing  in  that  state  and  taking  that  action.  However,  in  order 
to  update  or  add  values  to  the  Q-matrix,  an  agent  needs  to  explore.  If  an  agent  explores,  it  acts  without 
knowing  if  it  will  receive  a  positive,  negative,  or  zero  reward.  However,  without  exploring,  an  agent  could 
likely  never  learn  the  most  efficient  route  to  a  goal.  A  balance  between  exploring  and  exploiting  knowledge 
is  required  in  the  Reinforcement  Learning  process.  It  is  for  this  reason  that  an  e-greedy  policy  is  used.  The 
specific  update  rule  that  is  used  for  evolving  the  Q-matrix  is  shown  in  Equation  1. 

Qk+i(s,  a)  Qk(s,a)  +  a[rk+1  +  7111a xQk(s',a')  -  Qk(s,a )]  (1) 

a' 

In  Equation  1,  a  is  the  learning  rate,  7  is  the  discount  factor,  k  indicates  the  current  timestep,  r  is  the 
reward,  s  is  the  state,  a  is  the  action,  s'  is  the  next  state,  and  a'  is  the  next  action.  Both  a  and  7  are  always 
between  0  and  1.  The  learning  rate  establishes  the  importance  of  new  information  compared  to  previously 
obtained  information.  If  the  value  of  a  is  0,  the  agent  only  considers  the  old  information  and  does  not  learn 
anything  new;  if  the  value  is  1,  the  agent  simply  considers  the  new  information  without  regard  to  previously 
gained  information.  Furthermore,  7  measures  the  importance  of  knowing  the  value  of  future  rewards  when 
choosing  an  action.  If  7  is  near  0,  then  the  agent  will  primarily  consider  immediate  rewards,  while  a  value 
approximately  equal  to  1  will  cause  the  agent  to  focus  mainly  on  future  rewards.  This  update  rule  is  used 
in  Watkins’  Q-learning  as  shown  in  the  full  algorithm  below. 


2  of  8 


American  Institute  of  Aeronautics  and  Astronautics 


Watkins’  Q-learning  Algorithm : 

Initialize  Q(s,  a)  arbitrarily 
Repeat  (for  each  episode)  : 

Initialize  s 

Repeat  (for  each  step  of  episode)  : 

Choose  a  from  s  using  policy  derived  from  Q(e.g.,  e— greedy) 
Take  action  a,  observe  r,  sf 

Q{s,a )  <-  Q(s,  a)  +  a[r  +  7maxa/Q(s/,  a/)  -  Q(s,  a)\ 

S  i —  5/5 

until  s  is  terminal 


III.  Representations 

For  the  search  and  localization  guidance  task  involving  the  use  of  Regions  of  Interest  consisting  of  thermal 
updrafts,  close  consideration  to  states,  actions,  goals,  rewards,  and  other  parameters  is  important.  This 
particular  experiment  involves  a  mathematical  simulation  model  of  a  UAV  moving  in  two  dimensions.  States 
are  chosen  to  be  bank  angle,  heading  angle,  and  X  and  Y  position  in  the  global  coordinate  system.  The 
three  actions  that  are  available  to  the  vehicle  include  a  five  degree  increase  or  decrease  in  bank  angle,  or  no 
change  in  bank  angle. 

For  this  simulation,  the  starting  state  includes  arbitrary  values  of  bank  angle,  heading  angle,  and  X  and 
Y  global  positions.  The  available  actions  are  presented  in  increments  of  five  degrees  change  in  bank  angle 
to  permit  a  continuous  and  smooth  incremental  learning  process.  Four  states  are  included  in  the  method 
because  this  is  a  minimum  amount  of  states  needed  for  this  agent  to  still  adequately  learn  how  to  reach 
a  goal.  If  more  states  were  used,  there  would  consequently  be  more  state-action  pair  combinations.  This 
would  require  more  learning  episodes  to  be  completed  to  allow  the  agent  more  opportunities  to  sample  all 
possible  state-action  combinations.  It  is  therefore  important  to  keep  the  amount  of  states  and  actions  to  a 
minimum. 

The  goals  for  this  experiment  are  based  on  the  thermal  updrafts.  Specifically,  they  are  the  magnitudes 
of  the  velocity  in  the  center  of  the  thermal.  The  values  of  the  velocity  are  assumed  to  be  known  prior 
to  the  learning.  An  agent  will  notice  if  it  has  reached  a  goal  if  the  X  and  Y  coordinates  of  the  state  in 
which  it  is  currently  located  is  within  a  certain  pre-defined  range  of  the  X  and  Y  global  positions  of  the 
center  of  the  thermal.  Rewards  are  subsequently  based  on  the  position  of  the  aircraft.  A  reward  of  +20 
is  given  if  the  agent  reaches  a  goal,  and  a  negative  reward  of  the  same  magnitude  is  rendered  if  the  agent 
reaches  the  boundary  of  the  grid.  Any  other  action  that  does  not  immediately  place  the  agent  in  the  goal 
range  or  a  position  on  the  boundary  receives  a  reward  of  zero.  Magnitudes  of  the  rewards  are  somewhat 
subjective.  A  reward  with  a  very  large  magnitude  can  lead  to  values  with  large  magnitudes  in  the  Q-matrix. 
However,  these  values  will  still  principally  be  proportional  to  values  in  the  Q-matrix  calculated  based  on 
rewards  of  smaller  magnitudes.  Rewards  of  less  than  unity  would  obviously  cause  values  in  the  Q-matrix  to 
be  unnecessarily  minuscule.  The  magnitude  of  the  rewards  has  no  major  effect  on  the  learning  process. 

IV.  Air  Vehicle  Modeling 

The  UAV  model  in  the  simulations  is  Pegasus.  It  has  a  wingspan  of  144  in,  and  its  fuselage  is  75  in  long 
and  about  15  in  wide.  Its  average  cruise  speed  is  in  the  range  of  30-50  m/s.  A  model  of  Pegasus  is  shown 
in  Figure  1.  The  equations  of  motion  used  in  the  simulation  are  for  constant  altitude  turning  flight. 
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Figure  1:  Photograph  of  Pegasus  UAV 


V.  Agent  Learning 

Learning  begins  with  a  declaration  of  a  state  that  the  user  chooses  or  is  chosen  randomly  by  the  computer. 
An  action  matrix  is  declared,  and  includes  as  inputs  the  values  that  can  be  added  to  the  bank  angle.  Variables 
representing  the  size  of  the  action  matrix ,  and  the  number  of  states  are  also  created  for  easy  reference 
throughout  the  program.  The  Q-matrix  is  formed  at  this  time  as  an  empty  matrix  that  can  be  expanded  as 
values  are  added  to  it.  It  has  the  same  amount  of  columns  as  there  are  actions.  In  this  instance,  it  begins  as 
three  cells  because  there  are  three  actions  available  for  the  agent.  Inside  each  of  these  three  cells  is  another 
matrix  with  number  of  columns  equal  to  the  amount  of  states  plus  one  additional  column  to  contain  the 
values  of  the  corresponding  state- action  pair.  Because  the  agent  is  allowed  to  explore  at  the  very  beginning 
of  the  learning  process,  it  chooses  any  of  the  three  actions  available  at  random.  A  new  function,  nextstate  is 
called.  Its  inputs  include  the  state  in  which  the  agent  existed  prior  to  its  move,  the  action  that  it  chose  to 
take,  the  action  matrix ,  time  step,  cruise  speed,  gravitational  constant,  and  the  number  of  states. 

Since  the  bank  angle,  heading  angle,  and  X  and  Y  position  that  comprise  the  new  state  will  exist  to 
infinite  decimal  places,  their  values  are  rounded  to  the  closest  four  decimal  places  at  the  end  of  the  nextstate 
function. 

Next,  another  function  is  called  to  determine  the  reward  that  the  agent  deserves  after  a  certain  move. 
As  previously  mentioned,  the  agent  will  receive  a  positive  reward  for  reaching  the  goal  range,  a  negative 
reward  for  reaching  a  boundary  point,  or  a  reward  of  zero  for  any  other  move  that  does  not  immediately 
lead  the  agent  to  the  goal.  In  order  to  prevent  the  agent  from  more  easily  maximizing  its  total  reward  by 
changing  its  bank  and  heading  angle  as  opposed  to  its  position,  the  reward  function  only  gives  rewards  based 
on  X  and  Y  position.  The  Q-matrix  is  updated  in  a  proceeding  function.  The  reward  value  is  placed  in  a 
cell  of  the  Q-matrix  based  on  the  action  taken  and  the  state  that  the  agent  was  in  before  the  action.  The 
agent’s  current  state  now  becomes  the  new  starting  state  and  a  new  cycle  begins.  When  the  agent  reaches 
the  predefined  goal,  the  episode  ends  and  a  new  episode  initiates. 

At  this  point,  an  agent  can  either  take  actions  randomly  or  greedily.  A  greedy  action  involves  an  agent 
choosing  an  action  that  it  already  knows  will  lead  to  a  high  reward.  Before  any  action  is  chosen,  however, 
the  egreedy  function  is  called.  It  takes  as  inputs  the  current  state,  the  Q-matrix,  the  number  of  actions,  and 
the  current  value  of  e.  The  function  outputs  the  act  ion- value  of  the  current  state  and  selected  action,  as  well 
as  the  best  action  that  the  agent  should  take  if  it  is  going  to  exploit  its  knowledge.  Otherwise,  the  egreedy 
function  will  choose  a  random  action  for  the  agent  to  take. 

As  new  states  are  being  determined  based  on  the  chosen  action,  the  heading  angle,  bank  angle,  and  X 
and  Y  position  values  will  not  be  whole  numbers.  Instead  of  rounding  the  X  and  Y  coordinates  and  heading 
angle  immediately  in  the  learning  process,  these  values  remain  continuous  during  the  process  of  calculating 
the  X  and  Y  coordinates  and  bank  and  heading  angle  for  the  next  state.  However,  at  the  time  that  the  values 
in  the  Q-matrix  need  to  be  updated,  the  X  and  Y  coordinates  and  heading  angle  are  rounded  according  to 
the  discretization  matrix.  This  allows  exact  states  in  the  Q-matrix  to  be  found  and  updated.  Heading  and 
bank  angle  values  are  rounded  to  the  closest  number  evenly  divisible  by  five.  On  the  other  hand,  X  and 
Y  coordinate  position  values  are  rounded  to  numbers  evenly  divisible  by  50.  This  rounding  is  important 
because  an  agent  is  required  to  move  from  one  grid  point  to  another.  The  grid  is  separated  in  increments 
of  50  m  in  both  the  abscissa  and  ordinate  axis.  At  this  point,  the  k-nearest  neighbor  algorithm  is  used  to 
transfer  the  agent  to  the  closest  point  on  the  grid.  The  next  state  is  then  calculated  using  the  actual  values  of 
X  and  Y  coordinates  and  heading  angle.  Importantly,  the  bank  angle  is  kept  as  a  value  divisible  by  5  during 
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the  learning  process  and  the  Q-matrix  update  process.  Consequently,  the  agent  behaves  more  realistically; 
the  X  and  Y  coordinates  and  heading  angle  are  rounded  only  in  the  process  of  updating  the  Q-matrix  and 
determining  the  next  best  action. 

An  important  step  in  this  process  is  saving  the  Q-matrix  at  various  time  increments.  This  enables 
additional  learning  for  the  agent  that  can  ensure  a  more  precise  Q-matrix.  In  this  example,  the  Q-matrix 
is  saved  every  200  episodes.  Furthermore,  the  code  allows  the  user  to  begin  the  Q-learning  process  and 
terminate  it  whenever  is  necessary.  Later,  the  user  can  choose  to  continue  the  learning  where  it  previously 
left  off  and  upload  the  already  partially  learned  Q-matrix.  Consequently,  the  previously  learned  Q-matrix 
can  be  improved.  As  each  new  simulation  is  completed,  files  named  according  to  the  current  date  are  saved 
with  the  values  of  all  parameters  and  of  course  the  Q-matrix.  This  organizes  each  process  and  expedites 
retrieval  of  previous  files  when  needed. 

The  Monte  Carlo  method  is  finally  used  to  determine  the  amount  of  times  that  the  agent  reached  the 
goal.  The  Monte  Carlo  process  completed  here  also  noted  the  path  of  the  agent,  the  action  it  took  at  each 
state,  and  the  amount  of  positive  and  negative  rewards  given.  These  values  are  stored  in  separate  matrices. 
Based  on  this  data,  one  can  determine  the  effectiveness  of  the  learning  process.  Additionally,  because  the 
Q-learning  process  ends  as  soon  as  the  agent  discovers  the  goal,  it  is  necessary  to  end  the  cycle  in  the  Monte 
Carlo  episodes  once  the  agent  found  the  goal.  In  reality,  an  aircraft  does  not  stop  moving  once  it  reaches  a 
goal.  However,  to  ensure  the  proper  performance  of  the  learning  process,  it  is  sufficient  to  code  a  break  at 
the  time  that  the  agent  reaches  the  goal. 


VI.  Simulation 


The  objective  of  the  simulation  described  in  this  paper  is  to  show  how  Q-learning  affects  the  way  an  agent 
navigates  through  its  environment.  It  is  desirable  that  this  learning  method  be  precise  and  time  efficient. 
The  mathematical  simulation  of  the  UAV  was  executed  using  the  commercial  Matlab  software  program. 

The  simulation  involves  an  agent  beginning  the  Q-learning  process  along  the  boundary  of  the  grid.  Initial 
X  and  Y  boundary  coordinate  positions  are  random,  but  the  large  magnitude  of  episodes  required  promotes 
initial  experience  at  every  boundary  point.  The  agent  starts  at  every  grid  point  on  the  boundary  including 
the  corners  of  the  grid.  Furthermore,  initial  heading  and  bank  angle  of  the  aircraft  is  properly  defined  for 
this  simulation.  Specifically,  if  the  agent  begins  at  a  boundary  point  that  is  along  the  X  axis,  its  initial  state 
consists  a  0  degree  heading  angle.  If  the  agent  begins  along  the  Y  axis,  its  initial  heading  angle  is  90  degrees. 
For  every  starting  boundary  condition,  the  agent  begins  with  a  0  degree  bank  angle  and  a  heading  angle 
that  allows  the  agent  to  be  pointing  directly  to  the  center  of  the  grid.  Consequently,  an  agent  starting  at 
a  corner  grid  point  possesses  a  45,  -45,  135,  or  -135  degree  heading  angle  depending  on  the  location  of  the 
corner.  These  values  are  determined  based  on  the  parameter  constraints  chosen  by  the  user. 

When  the  action  taken  to  affect  the  bank  angle  is  known,  the  subsequent  state  can  be  determined  using 
Equations  2,  3,  4,  and  5  relating  aircraft  dynamics: 


A'l/j  =  ( timestep )ip 
dx  =  (timestep)ucos(A'ip) 
dy  =  (timestep)u  sin(A'0), 


(2) 

(3) 

(4) 

(5) 


where  ip  is  the  heading  angle,  (p  is  the  bank  angle,  g  is  the  gravitational  constant,  and  u  is  the  vehicle’s 
cruise  speed.  These  equations  are  used  because  only  constant  cruise  speed  and  constant  radius  turns  are 
considered.  The  agent  is  assumed  to  move  at  a  constant  cruise  speed  of  35  m/s,  the  time  step  is  established 
as  2  s,  and  the  acceleration  of  gravity  used  is  9.8  m/s 2 . 

Before  the  Q-learning  process  begins,  a  thermal  location  and  intensity  are  randomly  generated  using  a 
script  that  involves  running  a  check  case  of  a  NASA  DFRC  updraft  model.  It  is  displayed  on  a  grid  that  is 
a  1000  m  square  in  the  X  and  Y  coordinate  system.  It  should  be  noted  that  these  dimensions  can  be  altered 
to  allow  for  a  greater  learning  space.  However,  for  initial  learning,  it  is  significant  to  begin  the  process  on 
a  grid  this  size.  The  global  position  of  the  maximum  intensity  in  the  center  of  the  updraft  is  known  and 
presented  in  the  updraft  model,  and  the  velocity  at  this  coordinate  is  stored  as  the  goal  for  the  learning 
process.  The  user  declares  a  range  of  velocities  that  acts  as  an  invisible  radius  around  the  thermal.  The 


5  of  8 


American  Institute  of  Aeronautics  and  Astronautics 


agent  will  attempt  to  at  least  reach  the  thermal  somewhere  in  its  range  to  receive  a  positive  reward.  In 
this  case,  the  maximum  of  the  range  is  4  m/s  and  the  minimum  is  slightly  greater  than  0  m/s.  Specifically, 
the  minimum  range  value  will  always  be  95  percent  of  the  maximum  intensity  velocity.  This  ensures  that 
positive  reward  values  are  not  given  to  the  agent  when  it  is  in  an  area  with  a  zero  velocity  thermal  gust. 

The  program  is  initially  set  to  run  about  150,000  episodes.  Within  each  episode,  the  agent  is  allowed 
500  cycles,  or  opportunities,  to  take  an  action.  The  parameters  are  simply  a  basic  starting  point  to  observe 
the  learning  process.  Not  all  cycles  will  always  be  used  within  each  episode  however.  If  the  agent  reaches 
a  boundary  point,  for  example,  the  cycle  will  end  and  a  new  episode  will  begin.  Other  constants  are  also 
initialized  before  the  learning  occurs.  The  learning  rate  is  set  to  0.1  and  the  discount  factor  takes  a  value  of 
0.7.  The  value  of  e,  the  parameter  utilized  in  the  e-greedy  method  that  promotes  random  action,  is  initially 
unity.  This  means  that  the  agent  will  begin  the  process  solely  exploring  and  choosing  actions  at  random. 
After  every  5  percent  of  episodes  are  completed,  e  will  decrease  by  0.05  to  ensure  exploitation.  However, 
this  value  is  not  permitted  to  reach  a  number  less  than  0.05  because  it  is  not  necessary  to  completely  ignore 
the  exploring  process. 

Certain  constraints  are  required  for  several  parameters.  For  the  two-dimensional  learning  process  de¬ 
scribed  in  this  paper,  both  X  and  Y  coordinates  are  restricted  between  0  and  1000  m  to  prevent  the  aircraft 
to  enter  negative  coordinate  positions  outside  of  the  grid.  The  heading  angle  is  permitted  to  exist  between 
-180  and  180  degrees,  while  bank  angle  is  required  to  be  between  -45  and  45  degrees.  These  values  can 
potentially  be  altered,  but  they  are  reasonable  and  effective  for  this  specific  simulation. 


VII.  Numerical  Examples 


The  first  simulation  completed  150,000  episodes.  One  thermal  was  placed  at  the  coordinates  (732.0689, 
702.2730)  which  rounds  to  the  grid  point  (750,  700).  One  trajectory  chosen  by  the  agent  is  shown  below 
in  Figure  2.  The  agent  began  learning  at  the  grid  point  (0,350)  with  a  heading  angle  of  90  degrees  and 
a  bank  angle  of  0  degrees.  Figure  3  shows  the  agent’s  time  history.  The  top  two  graphs  present  time 
versus  coordinate  position.  The  dotted  red  line  represents  the  respective  coordinate  of  the  thermal  and  the 
dotted  black  lines  above  and  below  represent  the  range  of  the  thermal.  The  bottom  two  graphs  present  the 
relationship  between  time  and  change  in  bank  and  heading  angles  for  this  specific  trajectory. 


Figure  2:  Trajectory  of  the  UAV 


Figure  3:  Time  History  of  the  Simulation 


The  next  simulation  completed  approximately  130,000  episodes.  Three  thermals  were  placed  at  the 
coordinates  (181.0450,  815.9635),  (384.2516,  269.0323),  and  (595.0047,  193.0734)  which  round  respectively 
to  the  grid  points  (200,  800),  (400,  250),  and  (600,  200).  Each  thermal  had  a  different  range  of  intensity. 
This  simulation  involved  requiring  the  agent  to  travel  to  the  closest  most  intense  thermal.  In  the  learning 
process,  reward-shaping  was  created  based  on  updraft  velocity.  When  the  locations  and  intensities  of  the 
three  updrafts  were  known,  the  distance  from  the  starting  point  of  the  agent  to  the  center  of  each  thermal  was 
calculated  using  a  simple  distance  formula.  For  this  simulation,  velocity  was  given  twice  as  much  significance 
as  distance.  Therefore,  if  the  agent  traveled  to  the  thermal  of  highest  velocity,  it  would  receive  a  reward 
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twice  as  great  as  traveling  to  the  thermal  with  the  shortest  distance  from  the  agent’s  initial  state.  In  the 
code,  these  metrics  relating  the  rewards  based  on  velocity  and  distance  are  represented  as  variables  and  can 
be  altered  if  necessary.  A  single  trajectory  of  this  simulation  is  shown  in  Figure  4.  The  agent  began  learning 
at  the  grid  point  (1000,  950)  with  a  heading  angle  of  -90  degrees  and  a  bank  angle  of  0  degrees.  Through 
learning,  the  agent  appropriately  chose  to  travel  to  the  thermal  at  the  location  (181.0450,815.9635).  Figure 
5  shows  the  agent’s  time  history  for  this  specific  trajectory. 


Figure  5:  Time  History  of  the  Simulation 


VIII.  Conclusions  and  Future  Work 

Results  presented  in  the  paper  demonstrate  the  potential  for  Reinforcement  Learning,  specifically  Q- 
learning,  to  guide  an  Unmanned  Air  System  to  a  specified  Region  of  Interest  from  a  specified  starting  state. 
There  are  several  conclusions  that  can  be  drawn  from  these  results: 

1.  The  Q-learning  technique  is  effective  at  navigating  a  UAV  to  fixed  thermal  locations.  In  the  first 
simulation,  the  agent  reached  the  goal  about  85  percent  of  the  time;  in  the  second  simulation,  the 
agent  reached  the  goal  approximately  99  percent  of  the  time. 

2.  As  more  learning  episodes  are  completed,  the  navigation  ability  of  the  UAV  improves.  In  the  first 
simulation,  for  example,  100,000  episodes  were  completed  and  a  Monte  Carlo  test  showed  that  the 
agent  reached  the  goal  81  percent  of  the  time.  After  50,000  more  episodes  were  completed  based  on 
learning  previously  obtained,  the  first  simulation  generated  success  84  percent  of  the  time.  This  relation 
shows  that  the  success  for  the  first  simulation  increased  about  3  percent  after  completing  50,000  more 
learning  episodes. 

3.  Adding  additional  thermals  increases  the  computational  complexity  linearly.  When  comparing  the  two 
simulations,  the  second  simulation  with  three  goals  was  approximately  three  times  more  computation¬ 
ally  expensive  than  the  first  simulation. 

4.  Adding  additional  thermals  to  the  navigation  area  increases  the  probability  that  the  UAV  will  find  a 
thermal  updraft. 

Future  work  will  extend  the  approach  to  the  autonomous  soaring  problem.  This  will  be  done  by  making 
the  learning  more  precise  by  refining  the  grid  spacing  using  the  Adaptive  Action  Grid  Technique.  Ad¬ 
ditionally,  an  agent  will  learn  how  to  maneuver  in  a  three  dimensional  space.  This  will  involve  changes  in 
altitude  and  will  require  an  update  and  expansion  of  the  state  and  action  matrices.  Eventually,  this  machine 
learning  process  will  be  efficient  and  applicable  to  the  autonomous  soaring  of  a  real  sailplane. 
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This  paper  develops  a  general  control  algorithm  for  precision  output  tracking  of  nonlin¬ 
ear  non-minimum  phase  dynamics  of  an  autonomous  three  degree-of-freedom  unmanned 
helicopter.  Previous  approaches  in  flight  control  literature  have  shown  approximate  track¬ 
ing  by  neglecting  the  coupling  between  the  forces  and  moments  generated  by  the  control 
effectors.  However,  it  is  shown  that  this  coupling  is  significant  in  the  model  under  study 
and  cannot  be  neglected.  In  this  paper  the  coupling  is  retained  and  natural  time-scale  de¬ 
composition  of  the  vehicle  model  is  employed  for  accomplishing  asymptotic  tracking.  The 
design  procedure  determines  the  desired  internal  state  trajectory  and  the  control  scheme 
to  stabilize  the  helicopter  in  hover.  Stability  is  analyzed  using  Lyapunov  methods  and 
results  show  that  the  approach  is  able  to  accomplish  perfect  tracking  while  stabilizing  the 
closed-loop  system  and  keeping  all  closed-loop  signals  bounded. 


Nomenclature 


f ,  g,  h  sufficiently  smooth  vector  fields 
s  intermediate  variables  of  the  system 

u  control  input  vector  of  the  system 
x  slow  variables  of  the  system 
z  fast  variables  of  the  system 

a\s  longitudinal  tilt  of  the  tip  path  plane  of  the  main  rotor  with  respect  to  the  shaft,  rad 
Fx  body  force  in  the  forward  direction,  N 

Fz  body  force  in  the  vertical  direction,  N 

g  acceleration  due  to  gravity,  m/sec2 

fiM  distance  between  c.g  and  main  rotor  positive  in  the  upward  direction,  m 
Iy  moment  of  inertia  about  pitch- axis,  kgm2 

Im  distance  between  the  c.g  and  main  rotor  along  forward  direction,  m 

M  pitching  moment  acting  on  the  helicopter,  Nm 

m  mass  of  the  vehicle,  kg 

q  body  pitch-rate,  rad/sec 

Qt  tail  rotor  torque,  Nm 

t  slow  time-scale,  seconds 

Tm  thrust  of  the  main  rotor,  N 

Tt  thrust  of  the  tail  rotor,  N 

u  body  forward  velocity,  m/sec 

w  body  vertical  velocity,  m/sec  (positive  down) 

X ,  Y,  Z  North-East-Down  helicopter  frame 

Xf,Zf  body  forces  positive  along  north  and  down  respectively 

Xapp ,  Zapp  approximate  body  forces  acting  along  north  and  down  respectively 

x  inertial  position,  positive  pointing  north,  m 

*  Graduate  Research  Assistant,  Vehicle  Systems  &  Controls  Laboratory,  Department  of  Aerospace  Engineering,  Student 
Member  AIAA,  anshunl@tamu.edu 

t  Professor  and  Director,  Vehicle  Systems  &;  Controls  Laboratory,  Department  of  Aerospace  Engineering,  Associate  Fellow 
AIAA,  valasek@tamu.edu,  http://vscl.tamu.edu/valasek  (Corresponding  Author) 


1  of  18 

Copyright  ©  201 2  by  Anshu  Siddarth  and  John  Valasek.  Published  by  the  American  Institute  of  Aeronautics  and  Astronautics,  Inc.,  with  permission. 

American  Institute  of  Aeronautics  and  Astronautics 


z  inertial  position,  positive  down,  m 
Subscripts 

*  trim  quantity 

0  reference  quantity 

a  derivative  with  respect  to  angle  a\s 
d  desired  manifold 

r  reference  trajectory 

Symbols 

cq  aq ,  /?,  pi ,  Kq  ,  Kq  feedback  gains 

e,  €i  small  quantity,  singular  perturbation  parameter 

7  sum  of  longitudinal  tilt  and  desired  pitch-attitude  manifold  a\s  +  Od 

r  fast  time-scale  ^s-,  seconds 

0  body  pitch-attitude  angle,  rad 

0( )  order  symbol 

Pi  i  >  1,  Lipschitz  constants 

Superscripts 

derivative  with  respect  to  intermediate  time-scale 
dimensionless  quantity 

error  between  the  system  state  and  desired  trajectory 

I.  Introduction 

SEVERAL  important  flight  control  problems  are  characterized  by  unstable  internal  dynamics  consequently 
resulting  in  performance  limitations.  Some  common  studies  include  acceleration  control  of  tail-controlled 
missiles,2  control  of  planar  Vertical  Take-off  and  Landing  (VTOL)  aircraft3  and  Conventional  Take-off  and 
Landing  (CTOL)  aircraft.3  This  paper  examines  the  internal  dynamics  and  synthesizes  a  stabilizing  controller 
for  a  three  degree-of-freedom  longitudinal  dynamics  of  an  autonomous  helicopter. 

Hover  control  of  a  helicopter  is  one  of  the  most  challenging  non- minimum  phase  control  problems.  To 
qualitatively  analyze  this  behaviour  consider  the  helicopter  shown  in  Figure  1.  The  motion  of  the  helicopter 
is  described  in  North-East-Down  frame  shown  as  (V,  V,  Z)  in  the  figure.  Assume  that  the  helicopter  model  is 
allowed  to  pitch  only  about  the  Y  axis.  Tm  and  Tt  are  the  thrusts  generated  by  the  main  and  the  tail  rotor 
respectively  that  keep  the  vehicle  aloft.  The  angle  a\s  is  the  longitudinal  tilt  the  tip  path  plane  makes  with 
respect  to  the  shaft  of  the  main  rotor.  Side  view  of  Figure  1  shows  that  non-zero  tilt  induces  a  component  of 
the  main  rotor  thrust  along  the  horizontal  X  axis  and  consequently  the  helicopter  propels  forward.  Hence, 
in  order  to  remain  in  hover  the  main  rotor  thrust  and  the  angle  a\s  need  to  be  controlled.  However,  changing 
this  angle  has  another  consequence.  The  forward  component  of  the  thrust  that  it  creates  induces  a  clockwise 
pitching  moment  about  the  center  of  gravity  of  the  vehicle  causing  the  nose  to  drop.  In  order  to  remain 
level,  the  angle  ais  needs  to  be  corrected.  But  doing  so  alters  the  forces  acting  on  the  helicopter  and  the 
vehicle  departs  from  hover.  For  the  helicopter  under  study,  it  will  be  shown  in  Section  II  that  desired  Tm 
and  ais  required  to  maintain  hover  lead  to  unstable  oscillatory  pitching  motion. 

Previous  studies  for  hover  control  assume  that  the  dynamical  behaviour  of  a  helicopter  is  similar  to  that 
of  a  VTOL  aircraft  as  both  these  vehicles  have  direct  control  over  the  aerodynamic  lift.  Hence  several  studies 
employ  the  control  developments  proposed  for  VTOL  aircraft.  Formulation  in  [3]  assumes  that  the  force 
contribution  from  the  longitudinal  tilt  angle  a\s  is  negligible.  Such  a  simplification  removes  the  coupling 
between  the  forces  and  the  pitching  moment  and  makes  the  resultant  dynamical  model;  approximately 
input-output  linearizable.  Reference  [  ]  used  feedback  linearization  for  stabilizing  the  resulting  approximate 
model  in  order  to  guarantee  bounded  transient  errors.  More  recently  back-stepping  has  been  used  for 
control  of  small  autonomous  helicopters.  ’  5  Other  control  techniques  based  upon  the  approximate  model 
include  dynamic- inversion  and  neural- network  based  adaptation.  In  order  to  mitigate  the  limitations 
due  to  under-actuation  some  techniques  take  advantage  of  the  inherent  multiple  time-scale  behaviour  of 
helicopters.  Reference  [11]  compared  linear  and  nonlinear  control  designs  for  the  approximate  model  using 
the  fast  rotational  dynamics  as  virtual  control  variables.  A  similar  approach  was  proposed  in  [  .2]  wherein 
Lyapunov  based  methods  were  used  to  guarantee  stability  of  a  radio/control  helicopter  model  using  the 
approximate  dynamics. 
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As  a  consequence  of  neglecting  the  coupling  between  the  forces  and  the  moments,  application  of  afore¬ 
mentioned  methods  is  limited  in  operating  regime  and  to  reference  commands  that  do  not  require  to  be 
precisely  followed.  Exact  output  tracking  was  demonstrated  by  retaining  the  coupling  terms  in  [  .3]  through 
stable- inversion  of  a  linear  helicopter  model.  This  inversion  computed  the  desired  input-state  trajectory  that 
along  with  feedforward  and  feedback  control  led  to  asymptotic  output  tracking.  Approach  in  [13]  emphasized 
that  internal- state  feedback  is  necessary  to  stabilize  a  non- minimum  phase  system.  However,  the  method 
required  an  infinite  time  preview  and  knowledge  of  the  complete  output  trajectory  beforehand. 

From  the  above  discussion  it  is  understood  that  helicopter  control  design  poses  three  major  challenges. 
First,  the  coupling  between  forces  and  moments  generated  due  to  rotor  is  significant  and  must  not  be  ignored 
during  control  design.  But  retaining  this  coupling  makes  the  system  non- minimum  phase  and  difficult 
to  stabilize.  Second,  a  non- minimum  phase  system  cannot  be  asymptotically  stabilized  in  real-time  with 
available  control  techniques  and  control  design  requires  substantial  offline  processing.  Third,  current  real¬ 
time  implementable  approaches  that  are  independent  of  the  reference  trajectory  are  limited  in  performance 
and  operating  regime. 

This  paper  presents  a  control  design  procedure  that  addresses  the  above  technical  challenges  and  validates 
the  general  nonlinear  control  procedure  developed  by  the  authors  in  [15],  [16]  for  a  three-dimensional 
longitudinal  model  of  an  autonomous  helicopter.  The  paper  makes  three  major  contributions.  First,  the 
control  design  takes  advantage  of  the  natural  time-scale  separation  and  unlike  the  techniques  discussed  in 
[  .1],  [  2]  the  coupling  between  the  forces  and  the  moments  of  the  helicopter  model  is  retained.  It  is  shown 
that  this  coupling  allows  design  of  a  sequential  procedure  for  computing  the  desired  internal  states  that  ensure 
asymptotic  output  tracking.  Second,  the  full-state  feedback  controller  designed  is  real-time  implementable 
and  is  independent  of  any  particular  operating  condition  and  desired  output  trajectory.  Third,  the  controller 
designed  is  causal  and  does  not  require  any  knowledge  or  preview  of  the  output  trajectory  beforehand. 

The  paper  is  organized  as  follows.  Section  II  describes  the  helicopter  model  under  study  and  examines 
analytically  the  non-minimum  phase  properties  of  the  vehicle.  The  nonlinear  control  design  and  stability 
of  the  closed-loop  system  is  analyzed  in  Section  III.  Simulation  validation  for  hover  control  is  discussed  in 
Section  IV.  Finally,  conclusions  are  presented  in  Section  V. 


Z 


Figure  1.  Coordinate  frame  and  forces  acting  on  a  helicopter 
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II.  Model  Description  &  Open-Loop  Analysis 


In  this  section  the  governing  equations  of  the  helicopter  model  are  presented.  Then,  the  exact  input- 
output  linearization  of  the  model  is  carried  out  and  it  is  shown  that  the  system  has  oscillatory  internal 
dynamics.  The  effect  of  neglecting  the  coupling  between  the  forces  and  moments  is  also  discussed.  Finally, 
a  time-scale  analysis  of  the  model  under  study  is  carried  out  and  essential  concepts  of  singular  perturbation 
theory  recalled. 


A.  Vehicle  Description 

The  helicopter  model  is  written  with  respect  to  earth- fixed  inertial  coordinates.  The  forces  and  the  moments 
act  in  the  body  frame  (see  Figure  1).  The  origin  of  the  body  fixed  frame  is  the  center  of  gravity  of  the  platform 
and  it  is  assumed  that  this  moves  with  the  motion  of  the  fuselage.  Reference  is  made  to  the  nomenclature 
for  the  meaning  of  the  symbols.  The  three  degree-of-freedom  equations  of  motion  of  a  symmetric  helicopter 
model  in  hover  (assuming  the  lateral/directional  components  are  in  equilibrium)  are  as  follows 


X 

cos  # 

sin# 

u 

z 

—  sin# 

cos# 

w 

mil 

-qw  +  Fx 

+ 

cos  #  —  sin  # 

0 

mw 

qu  +  Fz 

sin  #  cos  # 

_  rng  _ 

6  =  q 

IyQ  =  M 


(la) 

(lb) 

(lc) 

(ld) 


From  a  rigorous  standpoint,  the  above  set  should  be  augmented  with  dynamic  equations  of  longitudinal 
flapping.  However,  it  is  assumed  that  the  time-constant  for  the  flapping  of  conventional  rotor  blades  corre¬ 
sponds  to  one-quarter  of  a  rotor  revolution [1 7]  [pp  558-559].  This  justifies  the  use  of  rigid-body  equations  for 
describing  the  motion. 


Table  1.  Helicopter  Model  Parameters 


Parameter 

Value 

m 

4.9  kg 

ly 

0.27125 6kgm2 

hM 

0.2943 m 

Im 

—0.015m 

Qt 

O.OllOVm 

Ma 

25.23 Nm/rad 

The  body  forces  (FXl  Fz)  and  pitching  moment  M  are  generated  by  the  main  rotor  and  controlled  by 
Tm,  main  rotor  thrust  and  a\s  longitudinal  tilt  of  the  tip  path  plane  of  the  main  rotor  with  respect  to  the 
shaft.  The  aerodynamic  model  given  below  is  taken  from  [  ]. 


Fx  —  Tm  sin  cl\s 

(2a) 

Fz  —  Tm  cos  cl\s 

(2b) 

M  —  Maais  —  FxHm  +  FJm  —  Qt 

(2c) 

with  the  system  parameters  given  in  Table  1. 

B.  Exact  &;  Approximate  Input-Output  Linearization 

The  non-minimum  phase  properties  of  the  model  under  consideration  are  analyzed  by  studying  the  input- 
output  relationship.  The  desired  outputs  for  the  control  design  are  the  inertial  coordinates  of  the  vehicle, 
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namely  (x,  z)  pointing  north  and  down  respectively.  Control  inputs  available  are  the  main  rotor  thrust  Tm 
and  longitudinal  tilt  ais.  Taking  second  derivative  of  each  output, 


X 

1 

cos# 

sin# 

’  Fx 

z 

m 

—  sin# 

cos# 

Fz 

(3) 


it  is  found  that  the  relative  degree  of  each  output  is  two.  This  implies  that  the  rotational  dynamics  given  in 
(lc),(ld)  constitute  the  internal  dynamics  of  the  system. 

In  order  to  analyze  the  internal  stability  of  the  system,  the  zero  dynamics  of  the  system  needs  to  be 
examined.  Toward  this  end,  the  control  vector  ( Tm,clis )  that  constraints  the  outputs  and  its  derivatives 
on  the  origin  is  computed.  From  (3)  and  the  aerodynamic  relations  given  in  (2)  the  following  solution  is 
determined. 


Tm 

mg 

CLl  s 

-e 

Using  the  moment  relation  given  in  (2c)  and  the  constrained  control  solution  (4)  the  zero  dynamics  are 
characterized  by  the  following  equations 


0  =  q  (5a) 

Q  =  j-  [— Ma#  —  mg{hM  sin  0  —  Im  cos  6)  —  Qt]  (5b) 

The  stability  of  the  above  system  is  analyzed  by  linearizing  about  the  trim  values  6*  =  0.018rad  and 
g*  =  Or  ad/ sec. 


Ad 

.  . 

j-(—Ma  —  mg  Km  cos  6*  +  rriglM  sin#*)  0 


AO 

Aq 


(6) 


The  linearized  eigenvalues  are  d=  12. 0439 j  and  no  conclusions  about  the  stability  of  the  system  can  be  drawn. 
Rewrite  the  internal  dynamics  (5a)  and  (5b)  as 


0  =  j-  (— - Ma0  —  mg(IiM  sin  0  —  lM  cos  6)  —  QT)  (7) 

to  notice  that  the  pitch- attitude  dynamics  does  not  contain  any  damping  terms.  In  order  to  analyze  its 
stability  consider  the  quadratic  positive-definite  Lyapunov  function  Vq  —  \  ^02  +  ^q2.  The  rate  of  change 
of  the  Lyapunov  function  along  the  trajectories  of  (5)  is 


Ve 


^00  +  qq 

1y 

mg  .  n  ,  mg,  a 

—  —j—hMq  sint/  +  — —  /m^cos# 

*y  *y 

^ 

L  ly  1y 


-j-QT(l 

1y 


(8a) 

(8b) 

(8c) 


Note  the  function  h{0)  =  Km  sin 0  —  Im  cos#  is  monotonically  increasing  on  the  set  #  G  [— 7r/2,  7t / 2].  This  ob¬ 
servation  along  with  the  parameters  given  in  Table  1  conclude  that  Vq  <  0  on  the  set  {#  G  [—0.0509, 7r/2]  f]  q  G 
[0,  oo)}  (J{#  E  [~pi/ 2,  —0.0509]  f]q  G  (— oo,0]}.  On  this  set  (#*,g*)  is  the  only  equilibrium  point  and  hence 
from  the  Poincare-Bendixson  criterion  it  is  concluded  that  a  family  of  periodic  orbits  exist.  This  conclusion 
is  confirmed  in  simulation  and  the  results  are  presented  in  Figure  2.  In  fact  the  conclusions  drawn  from  the 
Poincare-Bendixson  criterion  are  conservative  since  the  simulation  shows  that  a  continuum  of  periodic  orbits 
exist  for  the  complete  state-space.  Thus  the  control  inputs  that  stabilize  the  inertial  position  of  the  helicopter 
excite  the  periodic  behaviour  in  pitch  and  exact  input-output  linearization  is  not  a  desirable  control  solution 
for  the  longitudinal  model  under  study. 

Notice  the  non-minimum  phase  behaviour  is  due  to  the  nonlinear  coupling  between  forces  and  pitching 
moment  denoted  by  h(0)  in  (8).  This  coupling  comes  through  longitudinal  tilt  solution  determined  in  (4) 
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Figure  2.  Simulation  illustrating  the  oscillatory  response  of  the  pitching  motion 


that  produces  the  required  translational  forces.  This  dependence  is  explicitly  seen  by  expanding  the  force 
terms  on  the  right-hand  side  of  (3). 


Xf  =  —Tm  sin(#  +  als) 

Z f  =  —Tm  cos (6  +  CLis)  +  Trig  (9) 

In  the  above  equations  Xf  and  Zf  represent  the  forces  in  the  inertial  plane  acting  along  the  north  and 
down  directions  respectively.  Approximate  input-output  linearization  of  the  output  dynamics  is  possible  by 
neglecting  the  dependence  of  the  longitudinal  tilt  on  the  forces.  The  approximate  forces  thus  obtained  are 

Xapp  =  Tm  sin# 

Zapp  =  ~Tm  cos  0  +  mg.  (10) 

The  exact  and  approximate  forces  acting  on  the  helicopter  under  study  is  shown  in  Figure  3  for  hover 
simulated  in  Section  IV.  Initially  the  helicopter  is  flying  at  an  arbitrary  flight  condition  and  the  forces  are 
non-zero.  Notice  after  two  seconds  the  vehicle  enters  steady  state  and  the  exact  horizontal  and  vertical  forces 
become  identically  zero.  However,  the  approximate  horizontal  force  remains  non-zero.  The  error  between 
the  exact  and  the  approximate  forces  is  shown  in  Figure  4.  The  error  is  over  100%  in  the  horizontal  forces 
while  negligible  in  the  vertical  forces.  This  conclusion  is  consistent  with  the  fact  that  rotor  blade  tilt  induces 
a  horizontal  component  of  force  in  the  helicopter  and  is  not  negligible.  As  mentioned  in  the  introduction 
some  studies  use  the  approximate  form  given  in  (10)  for  control  design.  However,  this  large  error  limits 
these  methods  to  guarantee  only  local  bounded  tracking.  In  this  paper,  the  coupling  terms  are  retained  and 
asymptotic  tracking  is  guaranteed. 

C.  Time-Scale  Analysis  of  the  Helicopter  Model 

In  this  section,  an  important  observation  regarding  inherent  time-scale  characteristics  of  the  model  under 
consideration  is  made.  This  is  done  by  studying  the  rate  of  change  of  the  non-dimensional  system  equations. 
Toward  this  end,  define  a  set  of  reference  parameters  (to ,  xo ,  zo ,  uo  =  wq  =  Vo,  #o5  Qo,  mo,  Fx o,  Fz o,  Mo,  go,  Iy o) 
and  denote  the  respective  dimension-less  quantities  as 

t  =  t/to  x  =  x/xq  z  =  z/zq  u  =  u/uq 

w  =  w/w0  0  =  0/6o  q  =  q/qo  (11) 
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(a)  Horizontal  force  (b)  Vertical  force 

Figure  3.  Exact  and  approximate  forces  acting  on  the  helicopter  model  in  hover 


Figure  4.  Error  between  exact  and  approximate  force 
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The  original  dimensional  equations  given  in  (1)  are  transformed  into  non-dimensional  form  using  definitions 
given  in  (11). 
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Without  loss  of  generality  assign 
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=  e  where  e  <<  1.  This  leads  to 
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m 


where  Fx o  =  Fz o  =  mogo  has  been  used.  Notice  that  for  any  reasonable  value  of  the  mass  of  the  vehicle 

is  an  0(l/ei)  quantity  as  the  ratio  of  pitch-angle  and 
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mass  of  the  vehicle  is  very  small  and  e\  >  e.  Finally,  assuming  that  the  vehicle  is  in  hover 
non-dimensional  form  is  obtained 
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Notice  the  above  equations  indicate  that  the  rotational  dynamics  evolves  faster  than  the  translational  coun¬ 
terpart.  Such  class  of  dynamical  systems  are  called  singularly  perturbed  and  their  analysis  is  carried  out 
using  singular  perturbation  theory.  The  above  equations  can  be  cast  in  the  following  compact  form 


x  =  f(x,  s,  z,  u,  ei,  e)  (15a) 

61S  =  h(x,  s,  z,  u,  ei,  e)  (15b) 

ez  =  g(x,s,z,u,ei,e)  (15c) 

where  x  =  [x,z,u,  w]T  are  the  slow  variables,  s  =  [6}T  is  the  intermediate  variable,  z  =  [q]T  is  the  fast 


variable  and  u  =  [Tm,&is]T  is  the  control  input  to  the  system.  The  singular  perturbation  parameters  e 
and  ei  characterize  the  different  time  scales  in  the  system  and  satisfy  0<e<ei<<l.  For  presenting  the 
concepts  of  singular  perturbation  theory,  consider  the  two-time  scale  counterpart  of  (15) 

x  =  f(x,  z,  u,  e)  (16a) 

ez  =  g(x,  z,  u,  e)  (16b) 


The  system  considered  in  (16)  is  labeled  the  Slow  System  and  the  independent  variable  t  is  called  the 
slow  time-scale.  This  system  is  equivalently  written  as  the  Fast  System 

x'  =  ef(x,  z,  u,  e)  (17a) 

z'  =  g(x,  z,  u,  e)  (17b) 

where  '  represents  derivative  with  respect  to  fast  time-scale ,  r  =  and  to  is  the  initial  time.  Note  that  in 
the  slow  system  the  slow  states  evolve  at  an  ordinary  rate  whereas  the  fast  states  move  at  a  rate  of  O  Q) .  In 
the  fast  system  the  fast  states  evolve  at  an  ordinary  rate  and  the  slow  variables  move  slowly  at  a  rate  of  O(e). 
Geometric  singular  perturbation  theory1  examines  the  behaviour  of  these  singularly  perturbed  systems  by 
studying  the  geometric  constructs  of  the  reduced-order  models  which  are  obtained  by  substituting  e  =  0  in 
(16)  and  (17).  This  results  in  reduced  slow  subsystem 


and  reduced  fast  subsystem 


x  =  f(x,  z,  u,  0)  (18a) 

0  =  g(x,  z,  u,  0)  (18b) 

x7  =  0  (19a) 

z'  =  g(x,  z,  u,  0)  (19b) 


The  dynamics  of  the  resulting  reduced  slow  subsystem  are  constrained  to  lie  upon  an  six  dimensional  smooth 
manifold  defined  by  the  set  of  points  (x,  u)  e  R4  x  R2  that  satisfy  the  algebraic  equations  (18b): 

Mo  :  zd  =  zd(x,  u)  (20) 

This  set  of  points  is  identically  the  fixed  points  of  the  reduced  fast  subsystem  (19b).  Furthermore,  the  flow 
on  this  manifold  is  described  by  the  differential  equations 

x  =  f(x,zd(x,u),u)  (21) 

if  the  reduced  fast  subsystem  is  stable  about  the  manifold  Mo-  If  the  dynamics  of  (21)  are  locally  asymp¬ 
totically  stable  about  the  manifold,  then  it  can  be  concluded  that  the  complete  system  (16)  is  also  locally 
asymptotically  stable. 


III.  Control  Formulation  and  Stability  Analysis 

Singular  perturbation  theory  concludes  that  the  stability  properties  of  the  vehicle  depend  upon  the 
identification  of  the  manifold  Mo  for  the  internal  states  (#,  q).  In  general  time-scale  control  approaches  [21]  [pp 
315-320]  solve  the  nonlinear  set  of  algebraic  equations  (18b)  for  the  manifold  first  and  then  design  a  stabilizing 
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controller  for  the  reduced  system  given  in  (21).  Note  however  that  for  the  helicopter  model  the  moment 
equation  (Id)  is  nonlinear  and  may  possess  multiple  roots  for  the  pitch- attitude  angle.  For  a  problem  wherein 
the  operating  region  is  known  apriori  then  one  of  these  roots  may  be  chosen.  But  this  process  soon  becomes 
cumbersome  and  requires  substantial  vehicle  knowledge.  Additionally,  this  procedure  restricts  the  results  to 
local  operating  regimes.  The  authors  have  studied  this  problem  for  general  singularly  perturbed  systems  in 
[15,22].  In  this  paper  an  alternate  approach  is  proposed  that  ensures  uniqueness  and  the  global  nature  of  the 
results  by  considering  the  fast  states  as  additional  control  variables.  This  allows  computation  of  an  unique 
reference  for  the  internal  states  and  maintains  complete  system  stability.  These  ideas  are  mathematically 
formulated  and  analyzed  in  this  section. 


A.  Control  Synthesis 


Using  the  procedure  described  in  Section  II. C,  the  reduced  slow  subsystem  for  (1)  is  obtained  as 


X 

cos  Od  sin 

od  ’ 

z 

—  sin  Od  cos  Od 

mu 

Qd^  "b  -Cc 

+ 

mw 

qdu  +  Fz 

-  sin  0d 
cos  Od 


0 

_  rng  _ 

(22a) 

(22b) 


where  0d  and  qd  are  manifolds  to  be  determined.  Take  additional  derivatives  of  the  position  coordinates  to 
rewrite  (22)  as 


cos  0d 
-  sin  0d 


sin  0d 
cos  Od 


(23) 


Equation  (23)  shows  that  the  pitch-attitude  angle  along  with  the  control  variables  effect  the  position  dynam¬ 
ics.  Thus,  employ  the  pitch-attitude  angle  and  the  main  rotor  thrust  Tm  to  accomplish  the  control  objective. 
Toward  this  end,  rewrite  (23)  as 


mx=  -Tm  sin(ais  (0d,  qd)  +  0d)  (24a) 

mz  =  -Tm  cos(als(Od,  qd)  +  9d)  +  mg  (24b) 

Note  in  forming  the  reduced  slow  subsystem  the  fast  variables  have  been  assumed  to  be  on  the  desired 
manifolds  ( 0dlqd )•  Hence,  the  longitudinal  tilt  used  in  the  design  of  slow  control  variables  is  a  function  of 
these  desired  manifolds.  Further,  define  the  tracking  errors  x  :=  x  —  xr  and  z  :=  z  —  zr.  Let  the  desired 
dynamics  be  specified  as 


(25a) 
(25b) 

Combining  (24)  and  (25),  the  following  relations  are  obtained 


mx  =  m(xr  —  ax  —  fix) 
mz  =  m(zr  —  oi\z  —  /3iz) 


Tm  =  m\/  (xr  —  ax  —  fix)2  +  (zr  —  a\z  —  /3iz  —  g )2 
(xr  —  ax  —  fix) 


=  arctan  - 


(zr  -  aiz  -  Pi z  -  g) 


au{0d,qd ) 


(26) 

(27) 


Remark  1.  The  choice  of  using  main  rotor  thrust,  Tm  over  the  longitudinal  tilt  for  stabilization  of  the 
reduced  slow  subsystem  was  made  considering  their  actuation  time  constants.  It  is  well  understood  that 
thrust  generation  takes  longer  than  rotation  of  an  actuator  surface  or  in  this  case  the  rotor  blade.  While 
previous  work  of  authors  in  References  [22],  [23]  assumed  infinitely  fast  actuators,  this  paper  helps  in  assigning 
control  tasks  according  to  actuator  bandwidth. 

Equations  (26)  and  (27)  complete  the  design  for  the  slow  variables  of  the  system.  Notice  however  that 
the  manifold  qd  is  unknown  at  this  point.  Toward  this  end,  formulate  the  intermediate  subsystem  as 


10  of  18 


American  Institute  of  Aeronautics  and  Astronautics 


Reduced  Intermediate  Subsystem 


x  =  0 

0  =  qd 
M  =  0 


(28a) 

(28b) 

(28c) 


where  "  is  derivative  with  respect  to  .  The  manifold  q( d  must  be  designed  to  ensure  the  pitch-attitude 
follows  Od .  This  can  be  satisfied  by  the  following  relation  obtained  using  dynamic  inversion 

qd  =  -Ke(0-0d )  (29) 

where  Kq  is  the  feedback  gain. 

The  desired  manifolds  given  in  (27)  and  (29)  depend  on  the  longitudinal  tilt  a\s  which  is  unknown.  From 
the  discussion  detailed  in  Section  II. C,  it  is  known  (29)  is  a  fixed  point  of  the  Reduced  Fast  Subsystem 


x'  =  0 
0'  =  0 


Thus,  it  is  required  that  the  following  relation  holds  for  all  time 

M  =  -IvKq(q  -  qd) 

where  Kq  is  the  feedback  gain.  Rearrange  (31)  using  the  definitions  in  (2c), (27)  and(29)  to  get 
TMhM  sin(ais)  -  TmIm  cos(ais)  +  Maais  =  Qt  ~  IyKq(q  -  qd) 

The  nonlinear  equation  in  (32)  is  solved  for  the  control  ais  using  the  small-angle  assumption 


Qt  +  TmIm 

r  IyKq  1 

TMhM  +  Ma_ 

TMhM  +  TI a  _ 

(30a) 

(30b) 

(30c) 


(31) 

(32) 


(33) 


where  q  :=  q  —  qd-  For  completeness  substitute  (33)  back  in  (27)  and  (29)  to  compute  the  desired  internal 
states 


=  arctan  - 


(xr  —  ax  —  /3x) 
(zr  -  az -  z  -  g ) 


Qt  +  TmIm 
TMhM  +  Ma 


_  _  ^  (xr  —  ax  —  Bx) 

qd  =  - Ke0  +  Ke  arctan  - 7 - - - -  -  Ke 

(zr  -  az  -  Pxz  -  g) 


Qt  +  TmIm 
TMhM  +  Ma 


This  completes  the  control  design  procedure. 


(34a) 

(34b) 


B.  Stability  Analysis 

The  following  theorem  summarizes  the  main  result  of  the  paper. 

Theorem  1.  Suppose  the  controls  Tm  and  a\s  of  the  system  (1)  are  designed  according  to  the  feedback 
relations  given  in  (26)  and  (33).  Then  for  initial  conditions  in  the  operating  region  \6\  <  15 deg,  |ais|  <  25 deg 
and  0  <  Tm  <  69.48  the  control  uniformly  asymptotically  stabilizes  the  non-minimum  phase  helicopter  model 
(1)  and  equivalently  drives  the  states  x(t)  xr(t)  and  z(t )  — >>  zr(t)  keeping  all  other  states  and  control 
inputs  bounded. 
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Proof.  The  closed-loop  system  system  in  error  coordinates  is  given  as 


x  =  X\ 

xi  =  —Fx  cos (0  +  6d)  +  —Fz  sin(<9  +  0d)  -  xlr 
mn  rn 

Z  —  Z\ 

z\  =  ~—Fx  sin(<9  +  0d)  +  —  Fz  cos (0  +  0d)  +  g  -  zlr  (35) 

mm 

0  =  qd  +  q-  0d 
,  Md  +  (M-  Md)  . 

9  = - t - Qd 

1y 

where  6  :=  0  —  6d,  q  :=  q  —  qd  and  Md  =  Maais  +  —  Tm^m  —  Qt  is  the  moment  obtained  after 

making  the  small-angle  approximation  in  arriving  at  (33).  The  closed-loop  system  is  equivalently  written  as 

x  =  X\ 

xi  =  —Fx  cos6d  +  —  FzsmOd  -  xlr 
m  m 

H - Fx  cos (0  +  0d)  -  cos  0d  H - Fz 

ml  J  m 


sin (0  +  0d)  -  cos#d 


z  =  Z\ 

h  =  -~Fx  sin 0d+  —Fz  cos  0d  +  g  -  zlr 
m  m 


(36) 


m 


sin (0  +  6d)  -  sin0d 


-F  Z 

m 


cos  (0  +  0^)  —  cosOd 


0  =  qd  +  q  ~  0d 

z  Md  +  (M-  Md)  . 

q  = - j - Qd 

1y 

Using  the  relations  in  (25),  (29)  and  (31)  rearrange  (36)  to  get 
x  =  X\ 

xi  =  -olx!  -  /3x  +  —  cos 0d[Fx  -  Fx(als(6d,qd))}  +  —  sin 0d[Fz  -  Fz(als(Od,qd))\ 
m  m 


~FX 

m 


cos  (0  +  0d)  —  cos  6d 


-T  Z 

m 


1  (0  +  0d)  -  cos  9d 


Z  =  Z\ 

Zi  =  -OL\Z\  -  Pi Z-  —  sin ed[Fx  -  Fx(au(9d,qd))]  +  —  cos  9d[Fz  -  Fz(als(9d,qd))}  (37) 

m  m 

-  —Fx  [sin(0  +  6d)  -  sin  0^1  +  —Fz 
ml  J  m 


cos  (0  +  6d)  —  cos  6d 


0  =  —KqO  +  q  —  0d 

z  ^  „  ,  M-Md  . 
q  =  -i^g  H - - - gd 

1y 

Closed-loop  system  stability  of  the  system  states  is  analyzed  using  the  Lyapunov  function  approach. 
Consider  a  positive-definite  and  decrescent  Lyapunov  function  candidate 


V(x,xi,z,  zi,0,  q)  =  ^  f3x 2  +x\  +  P\z2  +  zf  +  02  +  q2 


(38) 
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for  the  complete  closed- loop  system.  The  derivative  of  V  along  the  trajectories  of  (37)  is  given  by 

V  =  -  ax\  +  —  cos ed[Fx  -  Fx(als(Od,qd))]xi  +  —  sin  0d[Fz  -  Fz(als(6d,qd))]xi 
m  rn 


H - Fx 

m 


cos  (0  +  Od)  —  cos  t 


%i  H - Fz 

rn 


sin(#  +  Od)  —  cos  t 


Xi 


(39) 


sin Od[Fx  Fx (nis (Od-,  Qd))\zi  T  cos^f-F^  Fz{cL\s{0d-,  ^d))]^i 
m  m 


-  -e 

m 


zi~\ - F- 

rn 

'2  ,  nn  JS  -2  I  M  —  Md 


sin(6>  +  6>d)  -  sin6>rf 

-  k00*  +  Oq  -  eed  -  Kqq2  + 


COS (0  +  Od)  —  COS  t 
Q-QQd 


Zi 


Using  the  Lipschitz  behaviour  of  the  vector  fields  on  the  domain  defined  in  Theorem  1  the  following  conditions 
hold 


|  sin(0  +  Od)  —  sin 0d\  <  O.35|0| 
\Fx-Fx(ala(0d,qd))\  <  \Tm\ 

|  cos (6  +  Od)  —  cos  Od  |  ~  0 

1^  -  ^(«is((9d,^))|  «  0 


Tm^m  +  FIa 


\q\ 


(40) 

(41) 

(42) 

(43) 


Note  conditions  given  in  (42)  and  (43)  give  bounds  on  the  magnitude  of  the  error  between  the  exact  and 
approximate  vertical  force.  This  bound  remains  close  to  zero  for  large  changes  in  0  and  this  condition  was 
numerically  verified  for  the  model  under  study  in  Section  II.  Resulting  derivative  of  the  Lyapunov  function 
given  in  (39)  using  conditions  (40)  through  (43)  becomes 


V  <  —  ax\  -f - \Tm\ 


IyKq 


Tm^m  +  FI a 


•  ol\z\  +  0.35|Tm| 


IyKq 


Tm^M  +  FI a 


\%i\\q\  +  0.35  —  |Tm||^i||0| 


\zi\\q\  +0.35-|TM||als||51||^| 


(44) 


-Ke0z  +  0q-00d-Kqqz  + 


2  ,  M-Md 


q  -  qqd  • 


The  time  derivative  of  the  manifolds  Od  and  qd  is  determined  next.  Toward  this  end,  rearrange  (27)  as 

Xdes(t) 


tan  7  = 


Zdes{t) 


(45) 


where  7  =  Od  +  &i s(0d,qd),  Xdes  =  ocr  —  ax  1  —  f3x  and  Zdes  =  zr  —  a\Z\  —  /3z  —  g  have  been  defined  for 
convenience.  Differentiate  (45)  to  get 


•  _  COS7  •  sin 7  • 

T  —  m  /  -^deS  rn  /  ^  de 


Tm/itt 


Tm /m 


(46) 


using  the  fact  Tm /m  =  y/ (X^es  +  Zjes)  and  definition  of  the  angle  7.  The  time  rate  of  change  of  the 
longitudinal  tilt  ais(0d,qd)  is  determined  by  differentiating  (33)  along  the  inertial  position  trajectories. 


d 


ais  ~  dt 


TmIm  +  Qt 


Tm^m  +  FI a 
lMMa  —  HmQt 


L  M 


(47) 


_(rMhM  +  FIa)2  _ 

where  Tm  =  msm'yXdes+mcos'yZdes-  Combine  (46)  and  (47),  to  determine  the  derivative  of  the  manifolds 


Od  —  m 
Od 


cos  7 
Tm 
Ke~0 


—  clt  sm  7 


X-des  T  m 


—  SUI7 


1  M 


—  clt  cos  7 


Zdes 


(48) 
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where  clt  =  and  the  vari°us  derivatives  are  a  function  of  closed-loop  system  dynamics.  Using 

properties  (40)  through  (43)  and  (37) 


\Xdes\  <  <xfi\x\  +  (<a2  -  P)\xi\  H - \Tm\ 

mn 

0  3^ 

\Zdes\  <  Otxfrlzl  +  (af  —  /3l)\zi\  + 


dlyKq 


+  Mc 

OL\IyK q 


\q\  +  0.35  — |Tm||0| 
m 


(49) 


m 


tM 


Tm^m  +  Ma 


|<z|  +  0.35— |TM||0|. 
m 


Combine  (48),  (49)  and  (44)  to  get 


1 


V  <  —  ax1  H - 1  7m  | 

m 

—  oq2^  -(-  0.35|Xj 


4*4 


Tm^m  +  Ma 


\xi\\q\  +  0.35  —  |Tm||#i||0| 


m 


M| 


4** 


Tm^m  +  Ma 


+  0.35  —  |7m||<Us||^i||0| 


m 


(50) 


-  Kg62  +  \e\\q\  +  (\e\  +  Ke\q\)\6d\  -  Kqq 2  +  (ff*  -  tf£)|%| 


By  definition  clt  is  a  small  quantity  and  \cosy\  =  |  siny|  <  1,  define  k  =  which  is  again  a  small  quantity. 
Substitute  for  time  rate  of  change  of  the  manifold  0d  into  (50)  to  get 


V  <  -  olx\  H - \Tm\ 

m 

~2 


IyKq 


Tm^m  +  Ma 


IyKq 


|x1||g|+0.35-|TMp1||^| 


—  a1S1  +  0.35|Tm|  Tm}im  +  Mi 

-  Ke62  +  \§\\q\  -  Kqq2  +  (Ke  -  K2)\0\\q\ 


m 


5i||g|  +  0.35-|TM||ais||ii||^| 
m 


(51) 


+  ^(|0|  +  Ko\q\) 


Oifi\x\  +  (a2  —  f3)\xi\  +  ai/3i\z\  +  (a2  —  f3i)\zi\ 


~\Tm\ 

m 


IyK q 


Tm^m  +  Ma 


(a  +  0.35aq)|g|  -  0.35(a  +  oq)  —  |TM| 101 

m 


Rearrange  (51)  to  get 


v  <  -r» 


(52) 


where  =  [x,  x\ ,5,  iq,  #,  g]T  and  matrix  K  is  given  below 

0  0  0  0  ii  i 

0  —a  0  0  /i3 

0  0  0  0  Ms 

0  0  0  — oq  M7 

Mi  M3  Ms  M7  -  0.35(<a  +  aq)^^ 

M2  M  4  Me  Ms  M9 


M2 

M4 

Me 

Ms 

M9 


-Kg-  +  0.35a!) 


TmIim+Mo, 


(53) 


where  /zi  =  yu2  =  Ms  =  °'325|[m|  +0.5k(q;2  —  /?),  M4  =  ^I^mI 

Me  =  M7  =  °-152TMl  +0-5K(a2-ft),  M8  =  IflT, 


_  KOtiPi 

M  5  —  2 


M9  =  0.5(^  -  K2  +  1)  - 


2m 
In,  Ka 


t  M 


TMhM+Mc 

IvKa 


Tm  hM  -\~Ma 


+  0.5/^7U6>  (q^2  —  /3), 

TMtM+M„  +0.5^(a2-/31), 
(a  +  0.35aq )  —  0.35fti^(<a  +  ai*e  constants  function 

is  negative  semi-definite  by  appropriate  choice  of  the  feedback 


of  the  feedback  gains.  Hence,  the  matrix 
gains.  Note  the  semi-definiteness  property  is  due  to  the  small  values  of  constants  Mi?  M2 5  Ms  and  M6-  Since 
V  <  0  and  V  >  0,  all  terms  in  V  G  that  is  {x,  5q,  i,  Si,  q}  G  £oo.  Furthermore,  since  the  reference 
trajectory  states  are  bounded,  all  terms  in  expressions  for  Tm  and  a\s  in  (26)  and  (33)  respectively  are 
bounded.  Hence  the  right  hand  side  of  the  closed-loop  system  in  (37)  is  bounded  and  thus  G  £oo.  Thus 
using  Barbalat’s  lemma  it  is  concluded  that  signals  of  vector  T  — *  0  as  t  — >>  oo  and  the  result  in  Theorem  1 
follows.  This  completes  the  stability  analysis.  □ 
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IV.  Simulation  Study:  Hover  Control 


The  purpose  of  this  section  is  to  illustrate  the  preceding  theoretical  developments  and  demonstrate  the 
controller  performance  for  an  autonomous  helicopter  model.  The  reference  trajectory  and  all  its  derivatives 
are  set  to  zero  to  illustrate  the  stabilizing  performance  of  the  controller  for  the  open-loop  non-minimum 
phase  system  (discussed  in  Section  II.B).  The  feedback  gains  were  chosen  to  preserve  the  time-scale  nature 
of  the  helicopter  model  a  =  aq  =  2,/3  =  /A  =  1,  Kq  =  3  and  Kq  =  10.  The  various  constants  for  matrix 
IK  are  AT  =  /is  =  0.082,  ^2  =  =  0.245,  fi 3  =  2.26,  /i4  =  0.755,  /i 7  =  1.06,  /is  =  0.5  and  /ig  =  —4.68. 

The  corresponding  eigenvalues  of  the  matrix  IK  are  Ay2  =  0.00,  A3  =  —1.65,  A4  =  —1.99,  A5  =  —8.62  and 
A6  =  —22.39  and  Theorem  1  guarantees  asymptotic  stability.  The  initial  conditions  chosen  were  x(0)  =  —2m, 
z( 0)  =  2m,  u(0)  =  w( 0)  =  0 m/sec,  0(0)  =  15 deg  and  q(0)  =  30 deg/sec. 

Figure  5  and  Figure  6  present  the  closed-loop  response  of  the  helicopter.  The  controller  demonstrated 
asymptotic  tracking  irrespective  of  the  desired  reference  trajectory  in  the  domain  (x,  z,  ii,  u/,  0,  q)  E  [—50,  50] mx 
[—15,  50]m  x  [—30,  20]m/sec  x  [—5,  20]m/sec  x  (— 7t/2,  ir/2)rad  x  [— 7r,  ir}rad/sec.  Notice  that  the  large  initial 
condition  errors  die  out  within  the  first  6seconds.  The  forward  velocity  is  increased  in  order  to  correct 
the  error  in  forward  position.  Close  output  tracking  is  a  result  of  precision  desired  manifold  following  by 
the  internal  states.  The  pitch-attitude  angle  settles  down  to  the  trim  value  of  0.018rad(1.03deg)  that  is 
automatically  computed  by  the  manifold  (34a).  The  time-scale  behaviour  of  the  system  states  is  apparent 
in  the  time  histories.  Notice  that  the  pitch-rate  starts  to  follow  the  desired  manifold  within  2seconds  fol¬ 
lowed  by  the  response  of  the  pitch- attitude  angle  closely  tracking  the  desired  manifold  within  4seconds.  The 
transient  errors  of  the  slowest  and  also  the  outputs  of  the  problem  under  study  die  out  in  6seconds.  The 
control  inputs  are  shown  in  Figure  7.  The  control  inputs  settle  down  to  the  trim  values  Tm  =  48.027V  and 
a\s  =  —  0.018rad(— 1.03deg)  once  the  system  errors  have  stabilized  about  the  origin.  The  two-dimensional 
trajectory  of  the  helicopter  is  shown  in  Figure  8.  Initially  the  helicopter  corrects  the  large  error  in  the  pitch- 
attitude  angle.  This  is  done  by  reducing  the  requirements  on  pitch-rate  and  in  turn  the  longitudinal  tilt. 
After  this  correction,  the  vehicle  starts  climbing  to  the  desired  hover  position.  From  then  on,  the  helicopter 
remains  in  hover. 


Figure  5.  Closed-Loop  Output  Response  of  the  Helicopter 


V.  Conclusions 

A  control  formulation  for  output  tracking  of  an  autonomous  nonlinear  non-minimum  phase  helicopter 
was  developed.  The  desired  internal-state  reference  and  feedback  control  to  stabilize  the  unstable  inter¬ 
nal  dynamics  were  computed  using  the  inherent  time-scales  of  the  system.  Controller  performance  was 
demonstrated  through  numerical  simulation  for  the  helicopter  in  hover. 

Based  on  the  results  presented  in  the  paper,  the  following  conclusions  are  drawn.  The  final  output  tracking 
error  for  the  positions  remained  within  |0.0010|.  This  perfect  output  tracking  was  a  result  of  perfect  internal 
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(a)  Pitch- Attitude  Angle  Time  History  (b)  Pitch-Rate  Time  History 


Figure  6.  Closed-Loop  Internal  Dynamics  of  the  Helicopter 


(a)  Main  Rotor  Thrust  Time  History 


(b)  Longitudinal  Tilt  Angle  Time  History 


Figure  7.  Control  Inputs  to  the  Helicopter 
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Figure  8.  Closed-loop  Trajectory  of  the  Helicopter 


state  tracking  that  was  achieved  by  the  nonlinear  feedback  law.  The  results  of  Theorem  1  are  restricted 
in  operating  regime  due  to  the  small  angle  approximation  made  in  (33).  Unlike  previous  approaches  this 
limitation  is  not  due  to  simplifications  made  to  the  dynamical  model  and  can  be  improved  by  use  of  non- 
afhne  control  methods.  In  fact  the  conclusions  regarding  operating  region  of  the  controller  from  Theorem  1 
are  conservative.  As  shown  in  the  simulation  section,  the  controller  demonstrates  stable  performance  for  a 
large  operating  region.  Additionally,  the  controller  is  causal  and  does  not  require  any  prior  information  or 
preview  of  the  desired  reference. 
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Abstract — For  a  class  of  linear  time-invariant  systems  with 
uncertain  parameters,  this  paper  proposes  and  develops  a  notion 
of  the  Domain  of  Control  Authority  to  achieve  stable  adaptation 
in  the  presence  of  control  position  limits.  The  Domain  of  Control 
Authority  defines  the  subspace  in  which  the  plant  state  can  be 
driven  in  any  arbitrary  direction  by  bounded  control.  No  re¬ 
strictions  are  placed  on  the  stability  of  the  open-loop  system.  To 
address  the  problem  of  containing  the  state  within  the  Domain  of 
Control  Authority,  a  switching  control  strategy  with  a  direction 
consistent  control  constraint  mechanism  is  developed  for  an 
unstable  plant.  This  restricts  the  resultant  direction  of  the  rate  of 
change  of  the  state  to  be  the  same  as  the  direction  of  the  desired 
reference  state.  Stability  proofs  are  provided,  and  controller 
performance  is  demonstrated  with  numerical  examples  of  a  two 
degree-of-freedom  dynamic  model  and  an  F-16XL  aircraft  model. 

Index  Terms — Adaptive  control,  control  saturation  constraints, 
direction  consistent  control  constraint  mechanism,  dynamic  inver¬ 
sion,  linear  systems. 


I.  Introduction 

ACTUATOR  saturation  is  a  major  consideration  for  all 
practical  control  systems,  and  many  strategies  to  over¬ 
come  its  effects  have  been  studied.  For  example,  Hu  and  Lin 
have  done  seminal  work  in  analyzing  the  controllability  and  sta¬ 
bilization  of  unstable,  linear  time-invariant  systems  with  input 
saturation  [1]— [3] .  They  explicitly  identified  the  null  control¬ 
lable  region  of  the  state-space  for  linear  systems  with  saturated 
linear  feedback.  However,  their  work  does  not  address  systems 
with  uncertain  parameters.  Traditionally,  adaptive  control 
assumes  full  control  authority  and  lacks  a  theoretical  basis  for 
control  in  the  presence  of  actuator  saturation  limits.  Saturation 
is  a  problem  for  adaptive  systems  since  continued  adaptation  in 
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the  presence  of  saturation  may  lead  to  instability.  In  the  past, 
much  effort  has  been  expended  for  adaptive  control  design 
in  the  presence  of  input  saturation  constraints  [4].  Karason 
and  Annaswamy  presented  the  concept  of  modifying  the  error 
proportional  to  the  control  deficiency  [5].  They  laid  out  a 
rigorous  mathematical  proof  of  the  boundedness  of  signals 
for  a  model  reference  framework  and  identified  a  set  of  initial 
conditions  of  the  plant  and  the  controller  for  which  a  stable 
controller  could  be  realized.  Akella,  Junkins,  and  Robinett  de¬ 
vised  a  methodology  to  impose  actuator  saturation  constraints 
on  the  adaptive  control  law  analogous  to  Pontryagin’s  principle 
for  optimal  control  in  order  to  make  the  error  energy  rate  as 
negative  as  possible  with  admissible  controls  [6].  They  iden¬ 
tified  a  boundary  layer  term,  which  is  the  difference  between 
the  calculated  control  and  the  applied  control,  and  imposed 
conditions  on  the  adaptive  update  laws  to  bound  the  boundary 
layer  thickness.  More  recently,  Johnson  and  Calise  applied 
the  concept  of  “pseudo-control  hedging”  to  adaptive  control, 
which  is  a  fixed  gain  adjustment  to  the  reference  model  that  is 
proportional  to  the  control  deficiency  [7].  Lavretsky  and  Hov¬ 
akimyan  have  proposed  a  new  design  approach  called  “positive 
(i -modification”  that  guarantees  that  the  control  never  incurs 
saturation  [8].  In  [9]  the  “£i  adaptive  controller”  is  extended 
to  include  control  constraints  for  linear  plants  with  known 
control  influence.  Hong  and  Yao  [10]  synthesized  a  robust  con¬ 
troller  specifically  for  precision  control  of  linear  motor  drive 
systems  using  backstepping,  while  addressing  the  different 
physical  uncertainties.  Kahveci  and  Ioannou  [11]  extended 
the  anti-windup  compensator  design  for  stable  systems  with 
actuator  position  and  rate  limits,  and  a  similar  problem  was 
addressed  in  Leonessa  et  at.  [12]  by  modifying  the  reference  to 
maintain  system  stability  and  control  within  bounds. 

Dynamic  Inversion  is  an  approach  which  has  been  widely 
used  in  recent  years  for  the  control  of  nonlinear  systems,  es¬ 
pecially  in  the  field  of  aerospace  engineering  [13]— [16] .  A  fun¬ 
damental  assumption  in  this  approach  is  that  the  inherent  plant 
dynamics  are  modeled  accurately,  and  therefore  can  be  canceled 
exactly  by  the  feedback  functions.  In  practice,  this  assumption 
is  not  realistic;  the  dynamic  inversion  controller  requires  some 
level  of  robustness  to  suppress  undesired  behavior  due  to  plant 
uncertainties.  To  overcome  this  robustness  problem,  an  adap¬ 
tive  model  of  the  plant  dynamics  sometimes  is  used  to  facilitate 
the  inversion,  which  is  then  updated  in  real-time  based  on  the 
response  of  the  system.  This  gives  rise  to  an  entire  class  of  con¬ 
trollers  which  may  be  referred  to  as  adaptive  dynamic  inversion 
controllers  [17]. 

This  paper  investigates  problems  introduced  in  adaptive  dy¬ 
namic  inversion  control  schemes  due  to  bounds  on  the  control 
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and  develops  a  three  component  control  scheme  to  overcome 
them.  The  contributions  of  this  paper  are  the  identification  of 
the  domain  of  attraction  considering  the  control  position  limit 
and  the  development  of  a  switching  control  strategy  to  contain 
the  plant  within  this  domain.  Another  novel  idea  is  that  of  a  di¬ 
rection  consistent  control  constraint  mechanism  in  the  presence 
of  control  saturation.  This  is  achieved  in  part  by  preserving  the 
control  input  direction.  While  the  idea  of  preserving  the  control 
input  direction  using  control  allocation  is  not  entirely  new  [18], 
it  is  restricted  to  preserving  the  direction  of  the  control  vector 
only.  This  paper  formalizes  and  extends  a  concept  by  Tandale 
and  Valasek  in  [19]  that  not  only  preserves  the  direction  of  the 
control,  but  also  attempts  to  preserve  the  direction  of  the  re¬ 
sultant  rate  of  change  of  the  state  to  be  the  same  as  that  of  the 
desired  rate.  Additionally,  a  modified  adaptation  mechanism  is 
implemented  to  prevent  incorrect  adaptation  arising  from  tra¬ 
jectory  errors  due  to  control  saturation.  Here  the  mathematical 
development  of  the  control  scheme  and  the  adaptation  mecha¬ 
nisms  is  presented,  along  with  proofs  for  the  convergence  of  the 
tracking  error  and  the  stability  of  the  overall  control  scheme. 

This  paper  is  organized  as  follows.  Section  II  describes  the 
class  of  plants  that  are  considered.  Section  III  defines  the  con¬ 
cept  of  the  Domain  of  Control  Authority  (DC A)  for  plants  with 
bounded  control.  The  switching  control  strategy  and  the  direc¬ 
tion  consistent  control  constraint  mechanism  are  explained  in 
Section  IV.  The  development  of  the  control  law  and  the  modi¬ 
fied  update  law  to  prevent  the  incorrect  update  of  parameters  due 
to  saturation  is  presented  in  Section  VI.  Section  VII  presents 
simulation  results  for  a  two-dimensional  planar  plant  and  an 
F-16XL  aircraft  model.  Finally,  conclusions  are  presented  in 
Section  VIII. 

II.  System  Dynamics 

Consider  linear  time-invariant  continuous  dynamic  systems 
of  the  form 

x  =  Ax  +  Bua  (1) 

where  x(t)  G  lRn  is  the  state  vector,  A  G  [RnXn  is  the  ma¬ 
trix  of  unknown  plant  parameters,  u a(t)  G  Un  is  the  vector  of 
applied  controls  driving  the  system,  and  B  G  RnXn  is  the  un¬ 
known  control  effectiveness  matrix.  For  this  work,  the  number 
of  controls  equals  the  number  of  states  so  that  the  control  effec¬ 
tiveness  matrix  is  square  and  non-singular  to  permit  dynamic 
inversion.  Each  control  uai  is  symmetrically  bounded  between 
the  values  [— urni ,  Aumi].  The  plant  matrices  A  and  B  are  not 
known  exactly.  The  nominal  values  of  the  plant  matrices  A  and 
B  are  specified,  with  a  percentage  uncertainty  for  each  element 
of  the  plant  matrix  given  as 

A{j  =  Aij  +  A  A{j  (2) 

Bij  =  Bij  +  A  Bij  (3) 

where  |AA^-|  <  0.05|A^|  and  \AB{j\  <  0.05|^|. 

The  reference  trajectory  is  specified  in  terms  of  xr,  which  is 
chosen  such  that  it  is  uniformly  continuous,  bounded,  and  dif¬ 
ferentiable  with  first  order  continuous,  bounded  derivatives  xr. 
The  control  objective  is  to  track  any  feasible  reference  trajec¬ 
tory  that  can  be  followed  within  the  control  limits.  For  trajec¬ 


tories  that  are  not  feasible  with  respect  to  the  control  limits,  the 
objective  is  to  track  the  reference  trajectory  as  closely  as  pos¬ 
sible,  while  maintaining  stability  and  ensuring  that  all  signals  re¬ 
main  bounded.  Further,  it  is  assumed  that  the  entire  state  vector 
is  measurable  and  that  no  observer  is  necessary  to  estimate  the 
states. 

III.  Domain  of  Control  Authority  (DCA) 

One  of  the  most  fundamental  issues  associated  with  the  con¬ 
trol  of  a  system  is  controllability.  While  unconstrained  control¬ 
lability  [20]  has  been  well  understood  for  several  decades,  the 
understanding  of  constrained  controllability  is  incomplete  [4]. 
The  following  discussion  considers  how  bounds  on  controls  af¬ 
fect  controllability.  While  a  linear  scalar  plant  is  used  to  eluci¬ 
date  the  concepts,  the  discussion  extends  to  multiple-input-mul- 
tiple-output  (MIMO)  plants  in  which  the  number  of  controls 
equals  the  number  of  states.  Consider 

x  =  —  a*x  +  b*ua  (4) 

where  x{t)  G  U1,  and  a*,  b*  are  unknown  scalars  with 
6*  /  0  and  a*  >  0  such  that  the  inherent  dynamics  are 
stable.  The  applied  control  ua  is  bounded  symmetrically  as 
ua  G  f rum,  +wm],  where  um  >  0  is  a  known  control  limit. 

There  are  two  types  of  constraints  that  may  be  imposed  on 
the  plant  state- space  because  of  the  bounds  on  the  control. 

1)  Control  Authority  Constraint:  If  the  plant  is  open-loop 
stable,  the  only  diverging  tendency  that  can  propagate  the 
system  away  from  the  equilibrium  point  is  provided  by 
the  control.  This  diverging  tendency  can  be  infinite  for  a 
system  which  is  controllable  and  has  unbounded  control.  If 
the  control  is  bounded,  there  will  be  a  boundary  in  the  state- 
space  beyond  which  the  converging  tendency  of  the  plant 
is  greater  than  the  diverging  tendency  due  to  the  bounded 
control. 

2)  Tracking  Constraint:  Consider  the  plant  model  from  (4). 
Since  the  control  is  bounded  within  [— iim,  +um\,  the  rate 
of  change  of  the  state  at  any  point  of  time  is  bounded  by 

-a*x  +  b*Um,  ifu  =  urn 

—  a*X  —  b*Um ,  if  U  =  —Um 

where  x  is  the  plant  state  at  that  instant  of  time.  Any  ref¬ 
erence  trajectory  that  the  plant  can  successfully  track  must 
satisfy  the  rate  bounds  listed  in  (5). 

The  controllability  test  for  linear  systems  ensures  that  the 
control  can  affect  every  state,  but  does  not  consider  the  effect 
of  bounds  on  the  control.  To  have  complete  authority  over  the 
plant,  the  bounded  control  must  be  able  to  overcome  the  inherent 
plant  dynamics  and  prescribe  the  desired  dynamics. 

A.  Case  1:  Stable  Plant  (a*  >  0) 

Considering  the  plant  model  of  (4),  the  inherent  plant  dy¬ 
namics  are  given  by  the  term  —a*x.  The  control  authority  is  lim¬ 
ited  by  the  bounds  on  the  control  to  values  of  ±6*um,  so  there 
exists  boundaries  in  the  plant  state-space  beyond  which  the  in¬ 
herent  plant  dynamics  will  dominate  the  control  effort,  and  the 
plant  will  not  be  controllable.  These  boundaries  will  be  reached 
when  an  extremal  control  is  necessary  to  cancel  the  inherent 
plant  dynamics.  In  the  interior  region  the  control  has  the  ability 
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<b) 


Fig.  1.  Phase  plot  showing  the  domain  for  a  traceable  trajectory  for  an  open- 
loop  stable  plant  and  for  a  neutrally  stable  plant,  (a)  Phase  plot  for  an  open-loop 
stable  plant,  (b)  Phase  plot  for  a  neutrally  stable  plant. 


to  nullify  the  inherent  plant  dynamics  without  reaching  its  ex¬ 
tremal  values.  This  interior  region  is  called  the  DC  A.  Referring 
to  Fig.  1(a),  the  boundaries  in  terms  of  the  plant  state  are 


&  Abound  —  b  Ua 

(6) 

CL  Abound  —  d=  b  v>rri 

(7) 

_  ,  b*um 

Abound  —  ^  * 

n  * 

(8) 

a 


Equation  (8)  gives  the  vertical  bound,  and  (5)  gives  the  bounds 
on  the  rate  of  change  of  the  state.  Outside  of  the  DC  A  boundary 
x  <  0  and  the  plant  moves  toward  the  origin.  If  the  initial  state 
is  within  the  DCA,  then  the  control  cannot  drive  the  state  out¬ 
side  the  DCA.  If  the  initial  state  is  outside  the  DCA,  the  in¬ 
herent  plant  dynamics  will  drive  the  state  into  the  DCA.  Thus 
the  bounded  control  does  not  lead  to  instability  for  an  open-loop 
stable  plant,  but  only  to  a  limited  operational  envelope  for  the 
plant. 

B.  Case  2:  Neutrally  Stable  Plant  (a*  =  0) 

The  derivative  of  the  state  is  affected  only  by  the  control 

i  =  b*ua.  (9) 

The  plant  state  is  not  bounded  and  can  take  any  value  on 
[— oo,  Too] ,  but  the  rate  of  change  of  the  state  is  limited  due  to 
the  control  bound.  The  rate  limits  for  a  traceable  trajectory  [see 
Fig.  1(b)]  are 

(10) 


C.  Case  3:  Unstable  Plant:  ( a *  <  0) 

For  current  state  x  the  unforced  response  —a*x  drives  the 
plant  away  from  the  state  x  =  0.  If  the  plant  reaches  a  state 
where  the  destabilizing  tendency  becomes  greater  than  the  max¬ 
imum  restoring  contribution  that  the  control  can  provide,  then 
the  state  continues  to  diverge,  such  that  x  —>  oo  if  \x\  > 
\b*urn/a"\.  These  points  determine  the  boundary  of  the  DCA. 
If  the  state  crosses  these  points,  stability  of  the  system  cannot 
be  recovered. 

IV.  Switching  Control  Strategy 

Consider  a  plant  that  is  required  to  track  an  arbitrary  reference 
trajectory  using  a  dynamic  inversion  controller.  The  bounded 
control  must  cancel  the  inherent  plant  dynamics  yet  retain  suf¬ 
ficient  control  effort  to  prescribe  a  rate  of  change  of  the  state  in 
any  arbitrary  direction  of  the  state- space.  If  an  extremal  value  of 
control  is  necessary  to  cancel  the  inherent  plant  dynamics,  then 
there  is  at  least  one  direction  in  which  the  plant  state  cannot 
be  driven.  Therefore,  the  DCA  consists  of  the  set  of  the  system 
equilibrium  states.  Depending  on  the  system  stability,  some  of 
these  plant  states  can  be  driven  in  any  arbitrary  direction  by 
a  bounded  control.  The  boundary  of  the  DCA  is  defined  by 
the  states  in  which  at  least  one  control  must  take  on  its  ex¬ 
tremal  value  in  order  to  cancel  the  inherent  plant  dynamics.  Out¬ 
side  the  DCA  open-loop  stable  plants  can  never  cross  the  DCA 
boundary  and  remain  bounded.  Unstable  plants  diverge  since 
the  inherent  diverging  tendency  dominates  the  maximum  pos¬ 
sible  converging  tendency  that  the  control  can  provide. 

The  solution  strategy  proposed  here  is  to  identify  the  DCA 
and  to  develop  a  control  law  to  prevent  the  plant  state  from 
crossing  the  DCA  boundary.  The  control  required  to  perform 
the  tracking  objective  may  be  applied  when  the  state  is  not  near 
the  DCA  boundary.  Once  the  state  nears  the  DCA  boundary, 
the  control  can  be  switched  to  a  stabilizing  control  that  cancels 
the  plant  dynamics  and  provides  a  restoring  tendency  toward 
the  origin.  It  should  be  noted  that  whenever  the  state  is  within 
the  DCA,  the  magnitude  of  the  rate  of  change  of  the  state  is 
restricted  because  of  the  bounded  control,  but  the  direction  of 
the  rate  of  change  of  the  state  is  not  limited.  The  Sections  IV- A 
and  IV-B  discuss  methods  to  identify  the  DCA  boundary  and 
the  concept  of  stabilizing  control. 

A.  Enforcing  the  Switching  Control  Law  Without  Explicit 
Identification  of  the  DCA  Boundary 

The  DCA  is  defined  by  the  states  where  at  least  one  control 
must  equal  its  extremal  value  in  order  to  cancel  the  inherent 
plant  dynamics.  At  the  boundary  of  the  DCA,  (1)  becomes 

0  —  A  Xboimdary  T  B  Uextremal*  (H) 

The  entire  DCA  boundary  can  be  evaluated  by  substituting  all 
possible  values  that  the  vector  u  can  take  such  that  Ui  =  ±um . 
for  at  least  one  z,  where  i  =  1, . . . ,  n  and  Ui  indicates  the  zth 
control  input. 

Consider  a  2-D  state- space  for  simplicity  of  analysis.  The 
DCA  for  this  2-D  state- space  defines  a  rectangular  parallelo¬ 
gram  whose  vertices  are  obtained  from  (11)  when  both  compo- 


\xr\  <  b*Um. 
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nents  of  the  control  vector  are  equal  to  any  one  of  the  four  pos¬ 
sible  permutations  of  the  extremal  values  (zb urni ,  ±um2).  The 
edges  of  the  parallelogram  correspond  to  cases  when  only  one 
component  of  the  control  vector  is  equal  to  an  extremal  value. 
The  other  component  of  the  control  vector  can  equal  any  value 
within  the  control  bounds.  Consequently,  the  DC  A  boundary 
can  be  calculated  and  stored.  As  the  state  approaches  the  DCA 
boundary,  the  control  can  be  switched  from  a  tracking  control 
to  a  stabilizing  control.  This  idea  easily  extends  to  n -dimen¬ 
sions,  where  the  DCA  is  an  n-dimensional  parallelepiped.  How¬ 
ever,  this  approach  requires  explicit  identification  and  storage  of 
the  DCA  boundary,  which  can  be  computationally  intensive  for 
higher  dimensional  plants. 

An  alternate  approach  for  determining  the  switching  con¬ 
trol  law  is  to  use  a  scalar  measure  that  keeps  track  of  how 
close  the  operating  point  is  to  the  DCA  boundary,  instead  of 
defining  the  DCA  boundary  explicitly.  This  approach  can  be 
implemented  by  identifying  the  control  component  necessary 
to  cancel  the  inherent  plant  dynamics,  ucancei,  which  can  be  ob¬ 
tained  by  solving  the  following  equation  at  run-time: 

Ucancei  =  (12) 

The  applied  control  can  be  switched  from  tracking  to  stability 
as  ucancei  approaches  uextremah  which  occurs  when  at  least  one 
component  of  ucancei  is  equal  to  um.  Since  (12)  is  simple  to 
solve  can  be  solved  at  every  time  step,  this  approach  eliminates 
the  need  for  prior  explicit  identification  of  the  DCA. 

B.  Direction  Consistent  Control  Constraint  Mechanism 

For  a  multi-input  plant  the  bounded  control  not  only  restricts 
the  magnitude  of  the  applied  control,  but  also  changes  the  di¬ 
rection  of  the  system.  Fig.  2  illustrates  this  concept.  Consider 
a  scenario  with  two  controls  ui  and  U2.  Assume  that  the  con¬ 
trol  calculated  by  the  control  algorithm  to  track  a  desired  refer¬ 
ence  uc  is  greater  than  the  control  bounds  shown  by  the  box.  If 
each  control  is  saturated  to  its  respective  maximum,  the  control 
applied  to  the  plant,  ua,  has  a  significantly  different  direction 
compared  to  uc.  When  this  control  is  applied  to  the  plant  the 
resulting  rate  of  change  of  state  also  has  a  different  direction 
than  the  desired  direction.  Here  we  develop  a  control  strategy 
that  implements  ua,  direction  consistent  shown  in  Fig.  2  that  is 
within  the  position  limits,  which  not  only  maintains  the  same 
direction  as  uc,  but  also  attempts  to  preserve  the  direction  of 
the  resultant  rate  of  change  of  the  state  so  that  the  direction  of 
the  resultant  rate  of  change  of  the  state  is  the  same  as  that  of  the 
desired  rate. 

Consider  Fig.  3  in  which  the  plant  is  of  the  form  x  =  Ax+B u 
and  the  desired  control  required  to  track  the  reference  trajectory, 
Udesired*  is  calculated.  If  Udesired  is  outside  the  control  bounds, 
the  saturated  version  of  the  control  usat  is  applied.  In  Fig.  3(a), 
each  component  of  Udesired  is  saturated  to  its  respective  max¬ 
imum  value.  Consequently,  usat  has  a  different  direction  com¬ 
pared  to  Udesired ,  and  the  resultant  direction  of  xsat  is  different 
from  Xdesired-  In  Fig.  3(b),  the  saturation  is  enforced  in  such  a 
way  that  the  direction  of  usat  is  the  same  as  that  of  Udesired- 


m 

ua-  direction 
consisted!.  ^ 

uc 

)£< 

111 

Fig.  2.  Direction  consistent  control  constraint  mechanism. 

However,  the  preservation  of  the  control  direction  does  not  en¬ 
sure  that  the  resultant  direction  of  xsat  is  the  same  as  that  of 

^desired  • 

The  control  is  calculated  in  two  parts.  The  control  necessary 
to  cancel  the  inherent  plant  dynamics  is  calculated  first,  and  then 
the  control  which  produces  a  rate  of  change  of  the  state  in  the 
desired  direction  is  calculated.  Referring  to  Fig.  3(c),  the  first 
component  of  the  calculated  control  is  equal  to  —  Ax,  and  the 
second  part  is  equal  to  Xdesired  •  The  first  part  of  the  calculated 
control  will  be  within  the  control  position  limits  since  the  plant 
state  is  restricted  within  the  DCA.  The  second  part  of  the  con¬ 
trol  is  subjected  to  direction  consistent  control  saturation,  which 
preserves  the  direction  of  the  control  vector.  Therefore,  the  satu¬ 
rated  version  of  the  second  part  of  the  control  equals  xsat ,  which 
also  ensures  that  the  direction  of  the  resultant  rate  of  change  of 
the  state  is  the  same  as  the  desired  rate. 

V.  Tracking  Control  Law 
The  plant  model  is  of  the  form  given  by  (1) 

x  =  Ax  +  Bua  (13) 

where  A  and  B  are  known  constant  matrices  of  compatible  di¬ 
mensions  and  the  following  assumptions  hold. 

Assumption  Al :  B  is  non-singular. 

Assumption  A2:  The  initial  condition  x0  E  Un  is  such  that 
v  =  B~1Ax0  satisfies  \vk\  <  umk  for  all  k  =  1,  2, . . . ,  n. 

From  this  condition,  due  to  continuity,  there  always  exists  a 
scalar  A  >  0  such  that  vs  —  B~1(A  +  AI)x0  also  satisfies 
Kfc  |  <  Umk  for  all  k  =  1,  2, . . ,  y  n. 

Assumption  A3:  We  assume  here  that  A  >  0  is  chosen  such 
that  K|  <  umk  holds. 

The  tracking  error  is  defined  as 

e(t)  =  x(t)  —  xr(£).  (14) 

Differentiating  the  tracking  error  with  respect  to  time,  and  sub¬ 
stituting  (13)  into  (14), 

e  =  Ax  +  Bua  —  xr.  (15) 

Adding  and  subtracting  Ae  to  (15),  the  equation  for  the  error 
dynamics  becomes 

e  =  — Ae  +  (A  +  A  J)x  +  Bua  —  (xr  +  Axr).  (16) 
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Fig.  3.  Rate  of  change  of  the  state  due  to  various  control  saturation  strategies, 
(a)  Each  component  saturated  to  maximum,  (b)  Direction  consistent  control  pre¬ 
serving  the  direction  of  the  control  vector,  (c)  Direction  consistent  control  pre¬ 
serving  the  direction  of  the  resultant  rate  of  change  of  the  state. 


If 


U  a  =  -B  1(A  +  XI)x  +  B  1(xr  +  Axr)  (17) 

we  have  e  =  —  Ae.  The  parameter  A  thus  governs  the  closed- 
loop  dynamic  system  behaviour  and  needs  to  be  chosen  appro¬ 
priately  such  that  assumptions  A2  and  A3  are  satisfied.  We  now 
state  the  following  definitions: 

ut  =  —  B~1(A  +  A/)x  +  B~1(xr  T  Axr)  (18) 


and 


u  s  = —B  1(A  +  A/)x.  (19) 


Note  that  (18),  referred  in  this  work  as  the  tracking  control,  has 
two  components.  The  first  term  —  B~1(A  +  A J)x  cancels  the 
inherent  plant  dynamics.  The  second  term  5_1(xr  +  Axr)  pre¬ 
scribes  the  desired  dynamics  necessary  to  track  the  reference 
trajectory.  The  first  component  is  used  as  a  measure  of  how  close 
the  state  is  to  the  DCA  boundary  and  will  be  referred  to  as  the 
stability  control  us  defined  by  (19). 

Further,  define  for  each  kth  control  input 

upk  =  (20) 


where  k  =  1,2,3,.. .,n.  Then,  the  maximum  of  this  ratio  of 
stability  control  to  the  bound  on  each  kth  actuator  is  defined  as 


=  max  ( Tip 


.)• 


(21) 


The  quantity  urrip  is  used  as  a  measure  of  how  close  the  current 
state  is  to  the  DCA  boundary.  Given  the  definition  of  \xrrip  by 
virtue  of  (A3),  we  are  ensured  of  urrip  (0)  <  1 .  It  should  be  noted 
that  a  value  of  urUp  =  1  indicates  that  the  state  is  at  the  DCA 
boundary. 

For  a  point  which  is  inside  the  DCA,  but  approaching  the 
boundary,  the  magnitude  of  wrrip  keeps  increasing  from  0  and 
reaches  1  on  the  boundary.  To  avoid  saturation,  choose  scalar 
parameters  s i,  52,  pi,  and  p2  all  close  to  1,  satisfying  0  < 
Ump  (0)  <  si  <  52  <  pi  <  P2  <  1  that  decide  the  switch 
between  tracking  and  stability  control.  Whenever,  ump  >  s2, 
the  tracking  control  is  close  to  being  subjected  to  saturation.  In 
this  case,  the  concept  of  direction  consistent  control  mechanism 
is  implemented  and  the  saturated  version  of  the  tracking  control 
law  is  employed,  given  by  the  following  equation: 

utsat  =  -B~1(A+XI)x+(l-ump)Satdc  {.B-1(xr  +  Axr)}. 

(22) 

The  direction  consistent  control  saturation  function  Satdc  main¬ 
tains  the  direction  of  the  resultant  control  to  be  the  same  as  the 
direction  of  the  desired  control  in  spite  of  control  saturation.  The 
saturation  function  is  defined  in  Section  V-A. 

A.  Saturation  Function  Definitions 

For  any  Udes  £  lRn,  we  adopt  a  hard  saturation 

<S(udeSk)  =  rnin(umfc,  |udeSk  |)sgn(udeSk)  (23) 
,  n  or  alternately,  a  “soft”  saturation  of  the 


for  all  k  =  1,2 
form 


<S(udeSk)  =  tanh 


/Aide 


Un 


(24) 


for  all  &  =  1,  2, . . . ,  n,  where  (3  is  any  positive  scalar  param¬ 
eter.  Qualitatively,  (3  — >  oc  implies  that  the  “soft”  saturation 
approaches  the  “hard”  saturation  definition  stated  earlier.  If  each 
of  the  controls  is  allowed  to  saturate  independently  to  the  max¬ 
imum  allowed  value,  the  applied  control  is 

Uappk  =  <S(udesJ,  for  all  fc  =  l,2,  ...,n  (25) 

where  S  is  the  saturation  function.  If  direction  consistency  is 
to  be  maintained,  the  proportion  pk  to  which  each  control  is 
saturated  is  calculated  by 

^(^desk  ) 


qk  = 


^desk 


for  all  k  =  1,  2, . . . ,  n. 


(26) 
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The  minimum  saturation  proportion  qm-in  is  identified 

(/min  =  min(gi,  q2,  . . . , gn)  (27) 

and  all  controls  are  saturated  with  the  same  proportion.  Note, 
by  construction  qk  <  1  for  all  k  and  accordingly,  qm-in  <  1  also 
holds.  As  a  result,  the  direction  of  the  applied  control  vector  is 
the  same  as  the  calculated  control  vector  Udes  •  Therefore 


which  is  a  contradiction.  Thus,  \xrrip  (t)  <  p2  for  all  time  t  >  0. 
Finally,  note  that  the  closed-loop  tracking  error  system  is  given 
by 

e  =  — Ae  +  B6  (36) 

where  6  is  a  bounded  signal  of  time  whose  explicit  characteri¬ 
zation  is 


Satdc(Udes)  —  ^min^de 


B.  Complete  Control  Law 

Continuing  with  the  discussion  of  ump ,  wheneve 
is  greater  than  p2  the  stability  control  is  employed.  The  other 
design  parameters  s  i  and  p\  are  used  in  order  to  produce  smooth 
transitions  between  controls;  from  ut  and  utsat ,  and  from  utsat 
to  us .  This  transition  can  be  accomplished  by  a  linear  or  higher- 
order  interpolation  as  ump  takes  values  from  s i  to  52  and  p\  to 
p2 .  Finally,  the  applied  control  ua  is  defined  as 


(28) 

°, 

if  luU  1 

<  Umfc 

and 

Ufc  - 

Ut, 

if  si  < 

Ump 

< 

<§2 

8=  < 

^Lsat 

- 

if  s2  < 

Ump 

< 

Pl 

Uc  - 

Ut, 

VI 

Ump 

< 

P2 

value 

,  Us  - 

Ut, 

if  ump 

>  P2 

<  si 


or 


8=0,  if|utJ  <  u 
6  =  (3£2  -  2£3)((1  - 
-  (3£2  -  2p)B 


mk 

U 


(37) 


anduTOp  <  si 

,)Satdc(S_1(xr  +  Axr))) 
1(xr  +  Axr), 


(38) 


ua  =  ut  if  |utfc  I  <  umfe  and  uTOp  <  siV  k  -  1,  2, . . 
else, 

'  Ufe,  if  Si  <  ump  <  s2 

u  =  J  utaat  *  if  s2  <  Ump  <  Pl 

a  J  Uc,  if  Pl  <  Ump  <  p2 

,us,  ifump>p2 


n 


(29) 


where 


Ufe  =/(ut,utsat,si,s2,ump)  (30) 

Uc  =/(utsat,us,pi,p2,ump)  (31) 

have  been  introduced  for  smooth  transitions  between  controls. 
The  function  /  is  a  third- order  interpolation  scheme  defined  as 

/(ui,u2,ci,c2,ump)  =Ui  +  (3£2  -  2£3)(u2  -  Ui) 

Um  —  Cl 

£  =  ^ - -.  (32) 

c2  -  ci 


if  si  <  um  <  s2,  where  £  =  — — —  (39) 

«2  -  Si 

8  =  (1  -  ump)Satdc(.B_1(xr  +  Axr)) 

-  (B_1(xr  +  Axr)), 

if  s2  <  ump  <  pi  (40) 

8  =  (1  -  ump ) Satdc (-B _  1  (xr  +  Axr)) 

-  (B-\±r  +  Axr)) 

+  (3£2  -  2£3)  ((1  -  umJSatdc(B_1(xr  +  Axr)))  , 

if  Pi  <  um  <  p2,  where  £  =  — — El  (41) 

P2-P1 

8  =  -  (S_1(xr  +  Axr))  if  ump  >  p2.  (42) 

Due  to  boundedness  for  6,  we  have  e  E  C ^  ensuring  bounded¬ 
ness  for  all  closed-loop  signals. 

VI.  Adaptive  Case 


Next,  it  is  shown  that  selecting  p2  such  that  urrip  (0)  <  P2  <  1 
ensures  that  urUp(t)  <  p2V  t.  This  can  be  proved  as  follows. 
Let  urUp  <  p2  for  some  t  =  t\  >  0  and  let  utk  <  umk  and 
ump  <  5i,  then  the  applied  control  ua  =  ut.  Then  for  t  >  t\, 
x(t)  =  e-A(t-ti)^(^i)  +  function(:rr,  xr,  A)  and  urrip  (t)  = 
e(-A(t-ti))ump(ti)+function(arr,i;rj  A,  A,  B).Ifump(t)  > 
s  i  the  control  smoothly  switches  to  115 .  The  control  continues  to 
switch  from  there  on  depending  on  the  value  of  parameter  urrip . 
Suppose  there  exists  a  finite  time  t*  >  0  such  that  \xrrip  (t* )  =  p2 
andu mp(t)  >  P2,Vt  >  U.  Then,  ua(£)  =  u s(t)  =  -B~1(A-\- 
A I)x(t)  V  t  >  t*.  This  would  yield 


Before  proceeding  to  the  control  law  for  the  case  of  uncertain 
parameters,  the  following  assumption  is  made. 

Assumption  Bl:  Both  B  and  B  are  non-singular.  Further, 
suppose  there  exists  a  symmetric  matrix  L  that  is  either  posi¬ 
tive  or  negative  definite  such  that  LB  =  B,  or 

L  =  [I  +  ABB-1}-1.  (43) 

Assume  also  that  the  function  sgn(L)  is  known  and  defined  such 
that  sgn(L)  =  1  when  L  is  positive  definite,  and  sgn(L)  =  —  1 
when  L  is  negative  definite.  Additionally  define  matrix  P, 


x(i)  =  x(f*)e“A(t“t*)  V  t  >  t*.  (33) 


P  =  Lsgn(L)  (44) 


Using  the  definition  of  \xrrip  (t)  we  have  for  all  time 

U mp(t)  =e~x^t~t  )Ump(t*)  (34) 

_e_A(t-t*)p2  <  p2  (35) 


and  let  pm  =  Amax[P]  and  prn  =  AminfP]  be  the  maximum 
A  B  A  B 

and  minimum  eigenvalues  of  P,  respectively. 

_  Define  6  =  B~1(A  +  XI),  0  =  B~1(A  +  XI),  </>  =  B~\ and 
(j)  =  B~x ,  where  A  is  the  specified  closed-loop  eigenvalue  as 
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defined  in  earlier  section.  Thus  0  =  (B  +  A B)  1 ,  and  through 
the  application  of  the  matrix-inversion  lemma 

0  =  B-1  -  (B  +  A#)-1  ABB-1  (45) 

=  0  +  A0  (46) 

where  A0  =  —  (5  +  AT?)-1  AI?F?_1.  Now,  given  B  and  the 
fact  that  the  uncertainty  A B  satisfies  Bij  =  Bij  +  A Bij  from 
(3),  we  denote 


Thus,  we  have 


A 0u/max  =  max  [A(f>ij] 

A 4>ijmin  =  min  [A0^]  . 
ZsB 


4>ij  -  A <?Umin,  (frij  +  A 4>i 


<t>f  =<t>£  R"x" 

such  that  <f>  ij  e 


u IJ 

—  R-l 


4ij  +  A^mm.^i  +  A^max 


Similarly,  for  0  =  B  1(A  +  A/),  we  can  obtain 

^  =  [S_1(4  +  A/  +  A^)] 

-  (£?  +  AS)”1  ABB~\A  +  XI  +  AA) 
=  B~1(A  +  XI)  +  B~l  AA 

-  (B  +  AB)~1ABB~1[A  +  XI  +  AA] 
=  6  + A6 


such  that 


Oij  e 


A6Lmax  =  max  [A6U 

AA,AB 

A9ijmin  =  min  [A0«] 

J  AA.AB  1  J 


0ii  +  A0iimm,0ij-  +  A0ii 


As  before,  the  feasible  set  of  values  for  0  can  be  defined  as 

0/  =  {6*  e  RnXn\9ij  e 


'h-  A^min,%  +  A%max]}. 


Assumption  B2:  We  denote  z>s  =  —  0(O)xo,  where  the  hat 
over  the  variable  denotes  its  estimated  value.  From  (56),  vs^k  — 
— Xk(0)x0  and  for  any  given  x0  G  Un 


\va,k\  =  |Xfc(0)x0| 

<  max  1 1 

Of 

^  Pk  1 1 X0 


<  max  ||xfc||i||xo| 

6f 


(57) 

(58) 

(59) 


with  /ifc  =  max# 


|i.  Assume  that  for  any  x0  G  IRn, 
,  n.  For  some  p2 
selected  such  that  0  <C  P2  <  1 ,  select  the  scalar  a  such  that 


A*fc||x0||  <  ap2umk  for  all  k  =  1,2,. 


(47) 


(48) 


wherein  the  values  of  A(p,3  m,n  and  Ad>,j max  can  be  precom- 
puted  as  well-posed  optimization  problems  represented  via  (47). 
Then  the  feasible  set  of  values  for  0  may  be  defined  as 


_  •  /  Pm  (9  fl  k 


with  0  = 

i—  1  .7  =  1 

muxs  Jlxfclli 


and  /3k  = 


min»/  llXfclli 


(60) 

(61) 

(62) 


(49) 


for  all  A;  =  1,  2, . . . ,  n.  Note  that  this  assumption  is  more  con¬ 
servative  than  non-adaptive  case. 

Following  the  control  law  formulation  laid  out  in  the  previous 
section,  but  this  time  written  in  terms  of  0  and  0 


e  =  — Ae  +  B(ua  +  0x  —  0(xr  +  Axr)). 


(63) 


For  systems  with  unknown  0  and  0  matrices,  the  control  law  is 
defined  as 

ua  =  ut  if  |utfc  |  <  umfc  and  ump  <  siV  k  =  1, . . .  n 
else, 


(50) 


where  A#  =  B~lAA-  (B  +  AB)~l  ABB~l[A  +  A/  +  AA]. 
For  any  given  A,  B,  and  A  >  0  and  for  uncertainties  A  A  and 
A B  subject  to  (2)  and  (3)  we  can  again  predetermine 


ua  =  < 


U6, 

if  si  <  ump 

<  S2 

^■^sat  5 

if  s2  <  ump 

<  Pi 

Uc, 

if  Pi  <  U mp 

<  P2 

Us, 

if  U  m  >  P2 

where 


(51) 


ut  =  —  0x  +  0(xr  +  Axr) 
us  =  —  0x 


(64) 


(65) 

(66) 


(52) 


utSat  =  -  Ok  +  (1  -  ump)Satdc(0(xr  +  Axr))  (67) 

ut  and  uc  are  given  in  (30)  and  (31)  and  6  and  0  are  estimated 
values  of  6  and  0.  It  is  important  to  point  that  in  this  case  the 
definition  of  u™  is  revised  to 


u  mp  =  max|uP)fcJ 
k 


(53) 

Further,  define  column  vector  complements  of  the  matrices  0 
and  6 

<fiv  =  [011,  •  •  •  ,  01n,  021,  •  •  •  ,  02 n,  0nl,  •  •  •  ,  0rm]T  (54) 
@v  —  11?  •  •  •  ,  @lni  @2h  •  •  •  ,  @2m  @nli  •  •  •  ,  @nn\  (55) 

with  each  row  of  0  defined  as 

Xk  =  [0ku  0k2t  •  • . ,  Okn\  fo r  all  k  =  1, . . . ,  n.  (56) 


||Xfc  ||  l  ||x|| 

GU  m,u 


(68) 

(69) 


where  k  =  l,2,3,...,n.  Substituting  for  the  control  law  de¬ 
fined  in  (64),  the  closed-loop  error  dynamics  reduce  to 

e  =  -Ae  -  B(9  -  9)x  +  T?(0  -  0)(xr  +  Axr)  +  B8  (70) 

where 

6  =  ua  -  u,  (71) 
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that  is 


8 

= 

0  if 

1 1  <  UTO(J  andu, 

mp  ^  ^1 

(72) 

8 

= 

(3^2 

- 

2C3)((1  -  umr 

i)Satdc(<^(x 

-r  4~  Axr))) 

-(3^2 

-  2^3)0(xr  + 

Axr) 

if  si 

< 

< 

52,  where  ^  = 

H-rnp  S 1 

(73) 

S2  ~  Si 

8 

= 

(1- 

U mp  )Satdc(</>(xr 

+  Axr))  - 

(0(x 

“I-  Ax, 

:)) 

if  s2 

< 

ump 

< 

Pi 

(74) 

8 

= 

(1- 

^mp  )Sat^c(^(xr 

+  Axr))  - 

(0(x 

r  +  Ax, 

-)) 

+  (3£2 

-  2£3)  ((1  -  1 

^■mp  )Sat^c(0(xr 

+  Axr) 

>) 

if  pi 

< 

ump 

< 

p2 ,  where  £  = 

Ump  -  Pi 

(75) 

P2  “Pi 

8 

= 

-  (<A(> 

:r  +  Axr))  if  u 

mp  ^  P2 

(76) 

or 

8 

=  <5(0,  xr ,  xr 

)  for  all  u 

(77) 

In  terms  of  column  vector  complements  defined  in  (56)-(57), 
rewrite  the  closed-loop  error  dynamics  as 

e  =  -Ae  -  BAx(0v  -  6V)  +  BQ(<j>v  -  <f>v)  +  B6  (78) 


with  terms  Ax  and  Q,  defined  such  that 

A  X(6v-Ov)=0-O)x  (79) 

-  4>v)  =  (</>  -  0)(xr  +  Axr).  (80) 


A.  Adaptive  Laws 

To  satisfy  the  given  bounds  on  parameters  and  avoid  param¬ 
eter  drift,  the  projection  scheme  from  [21]  is  adopted 

9v,k=  r(0„,k)  (8i) 

4>v,k  =r(^,fc)  (82) 

where  T  is  defined  as 

{•,  if  •  €  (•min,  •max) 

•min,  if  •  <  •min  (83) 

^max  if  9  >  0max 

The  adaptive  laws  selected  to  be 

0V  =7A xTBTe  sgn(L)  -  ya1  (0V  -  0V)  if  ua  ^  us 
else  0V  =7A xtBtx  sgn (L)  -  707 (0V  -  0V)  (84) 

and  if  at  some  t  =  t*  >  0,  ump(£*)  =  p2,  0(t *)  /  0(t~),  reset 
0(t)  such  that  $(£*)  =  0{t~). 

4>v  =  -7 ttTBTe  sgn (L)  -  ya2((j)v  ~  (85) 


for  any  07 ,  <r2  >  A/7  and  the  adaptive  gain  7  is  chosen  to  satisfy 


7  >  max 
k 


®  pk 

_PmPWmk_  ' 


(86) 


5.  Stability  Analysis 

To  prove  stability  of  the  control  laws  in  (64)  and  the  adap¬ 
tive  laws  specified  in  (84)-(85),  choose  the  following  Lyapunov 
function  candidate: 


V  =eTPe+  - 


\\ov  -  0V ||2  -  || 0V  -  0V | 


1 

4 — 

7 


W&v  4*v 


(87) 


Details  of  the  proof  that  this  Lyapunov  function  is  non-negative 
are  presented  in  [21].  Now  take  the  time  derivative  of  (87),  and 
noting  that  the  true  parameters  are  constant 


V  =  -  2 A eTPe  -  2eTPBAx(0v  -  0V) 

+  2  eT  PLl(^)v  —  (j)v)  +  2  eT  PB6 

0V  (0V  -  0 v )  -  (0V  -  0V)T(0V  -  0V) 


2 

4 — 

7  L 

2 


7 


&V  (&V  <t>v)  (0 V  $v)  $v) 


.  (88) 


For  the  case  ua  /  us  substitute  for  P  from  (44)  and  the  adaptive 
laws  from  (84)-(85) 


V  =  —  2A eTPe  +  2eTB6  sgn (L)  -  207 (0„  -  ^)T(^  -  0V) 
-  2 (T2(4>v  -  4>v)T(4>v  -  <t>v)-  (89) 


Using  completion  of  squares 
V  <  -  A eTPe  + 


B\\28m2 


A Pm 

2(T2(0v  4>v')  4*v  ) 


207 (0„  -  <9„)T((9V  -  0V) 

(90) 


where  <5m  =  supt  1 1 6(t )  1 1 .  Note  from  (77)  6 (t)  is  a  function  of  0 
that  is  bounded  due  to  the  projection  scheme  adopted,  reference 
trajectory  xr  and  its  derivative  xr  that  are  bounded  by  choice 
and  the  parameter  A  that  is  a  positive  scalar  quantity  chosen  by 
the  designer.  By  virtue  of  these  signals,  it  is  guaranteed  that  6(t) 
is  bounded  for  all  time  and  its  supremum  exists.  Furthermore 

T  <  —  XeTPe  —  ^  {||$„  —  9V\\2  —  \\6V  —  0„||2) 

—  (ll^f  4*v\\  \\4*v  fivW  j 

+  —  (ll^u  -  evf  -  ll^u  -  @v\\2  j 

+  ^  {||$„  -  <?V||2  -  II <t>v  -  4>v ||2  j 
II  B\\^  fa  ^ 

4 - - - 2a  1  (0V  —  0V)T  (0V  —  0y) 

Apm 

-  2 a2(4>v  -  4>v)t{4>v  -  4>v)-  (91) 


Since  a  1  >  A/7  and  a2  >  A/7,  we  have 

'\B\\26m2 


v  <  -  af  + 


“t“  ^  'pdy,k  @v,k  @v.k')(@v,k  @v,k) 

^  k=  1 


\pn 
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2A 


^  @v,k^)(y@v,k  @v,k^) 


A 


k= 1 

2 


H“  ^^(2  (pv,k  (t>v,k  ,k)ij$*v  ,k  4*v,k) 

^  k=  1 

2A  ^  * 

^  fc  =  l 


or 


y  <  -  Ay  + 


\\B\\2Sm2 

Xprn 


T  ^  @v,k)(@v,k  @v,k) 

A  77,2 

^  @ v ,k)(@ v ,k 


fc=l 


A  x 

H-  ^  4*v,k ) 

7 

'  k=  1 


A 


Thus 

y  <  -Ay  + 


/e=i 


I^iiVm2 


A 

+  - 

7 


n 

^  ^  @v,k)  H” 

fc  =  l 


or 


y  <  -Ay  + 


|£||2£m2 

^ Pm 


A 

H — 

7 


E  E  {(0”ax  -  0”in)2  +  (cr  -  cy2} 

i=l  j  =  l 


Finally,  it  can  be  concluded  that 

y  <  -Ay +  e 


(92) 


^  4*v,k )  0v,fc)*  (93) 


(94) 


(95) 


(96) 


for  some  positive  constant  e.  This  ensures  that  the  Lyapunov 
function  y  is  uniformly  bounded  for  the  case  ua  ^  us . 

For  the  case  ua  =  us,  there  is  an  additional  term  in  (89) 

y  =  —  2A eTPe  +  2 eTB6  sgn(L)  —  2ai(6v  —  0V)T {0V  —  0V) 
-  2<72{4>v  ~  ^>v)T(^v  ~  4>v)  +  2xjBAx(9v  -  9V).  (97) 


Now 


2x^BAx(§v  -  9V)  =  2xfB(0  -  9)x 

=  2  xxB(9  —  9)xr  +  2  eT(9  —  9)TBTxr. 

(98) 


Define  p  =  ma xgltg2e0f  \\9i  -  02\\  and  xrm  =  supt  ||xr(i)||. 

Then 

V  <  -  A eTPe  -  2<ti -  9V)T0V  -  6V) 

-  2 ct2(4>v  -  4>v)T(4>v  -  <t>v)  -  a eTPe  +  2p\\B\\x2rrn 
+  2 eTBS  sgn(L)  +  2p||e||a;rm||S||  (99) 


or 


V  <  -  A eTPe  -  2ai(9v  -  6V)T (6V  -  9V) 

-  2 a2(4>v  -  4>v)t{4>v  -  4>v)  -  AeTpm||e||2  +  2p\\B\\x2rm 
+  2||e||||B||[§M+pxrm].  (100) 

Let  <5m  =  8m  +  pXrmi  and  replicate  steps  (91)-(94),  to  get 


v  <  -XV+^^+2p\\B\\xlm 

APm 

,  n  n 

+:EE[(jr-9.f)2  +  (^ 

1  i= i  j= i 


(101) 


Similar  to  (96),  it  can  be  concluded  that 


V  <  -XV  +  e 


(102) 


for  some  other  finite  positive  constant  e.  Therefore,  from  the 
uniform  boundedness  theorem,  one  concludes  that  e  G  £oo, 
0  G  Coo  and  $  G  ,  which  results  in  the  boundedness  of  all 
closed-loop  signals. 


C.  Control  Saturation  Analysis 

The  next  issue  to  be  analyzed  is  whether  the  control  signal 
stays  within  saturation  limits  for  all  time.  Suppose  that  at  some 
time  t  =  U  >  0,  ump(t*)  =  p2  such  that,  ua(£+)  =  us(f+). 
In  this  case 

IIXfc(f*)||:iHx(^)||2  <  a2p22Umk2  (103) 


from  the  definition  of  ump  in  (68). 

For  time  t  >  t*,  suppose  that  ua  =  us ,  then  the  state  evolves 
according  to 


x  =  —  Ax  —  BAX(0V  —  0V). 


(104) 


Consider  the  Lyapunov  function  candidate 


ys  =  xTPx  +  - 


Pv 


II ev  -  9V 


(105) 


Notice  that  the  form  of  (105)  is  similar  to  (87),  and  it  can  be 
verified  that  Vs  is  positive  definite.  Next,  take  the  derivative  of 
Vs  along  the  trajectory  in  (104)  to  get 


14  =  2  xTP 


-Ax  -  BAX[9V 


9V)  H — ( 9/ 

J  7 


9v)9Tv.  (106) 


Substituting  the  adaptive  law  for  0V  in  (84) 


Vs  =  —  2AxtPx  —  2xt  BAx(Ov  —  6v)sgn(L) 

+  ~(9V  -  9V)  7 A xtBtx  sgn(L)  -  7<7i(0„ 

7  L 

14  =  —  2A xT Px  —  2<ji(i9„  —  9V)T (6V  —  9 v)  <  0. 


(107) 

,  1 T 

ev) 

(108) 
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Thus,  one  can  conclude  Vs(t)  <  Vs(t*)  for  t  >  f*.  Further 

pm\\x(t)\\2  <  xT(t)Px(t)  <  Vs(t)  <  Vs(u).  (109) 

But 


V.{t.)  =  xT(U)Px(U)  +  -||^  -  6vf 

7 

Vs(u)  <pm||x(^)||2  +  — . 

7 


(110) 


Combining  (109)  and  (110) 


I|x(i)||2<^||x(i* 

Pm 


+ 


0 

7  Pm 


< 


Pm 


pm  min e  \\xk\\i 


||Xfc(i*)||i2||x(i* 


+ 


0 


7  Pn 


(HD 


The  next  simulation  develops  and  evaluates  control  laws  for 
a  lateral/directional  linear  model  representative  of  the  F-16XL 
aircraft.  This  example  demonstrates  that  the  control  laws  devel¬ 
oped  earlier  are  also  applicable  to  systems  of  the  form 

mx  +  cx  +  &x  =  0.  (117) 

This  example  presents  the  necessary  equations  for  imple¬ 
menting  control  for  kinematic  tracking.  The  motive  of  this 
example  is  to  compare  the  response  of  the  system  with  and 
without  the  switching  control  law.  It  is  demonstrated  that 
without  implementation  of  the  switching  control  law  the 
system  goes  unstable,  while  with  the  switching  mechanism 
the  plant  state  remains  within  bounds  and  consistent  with  the 
reference. 

B.  Second-Order  Unstable  Plant  With  Unknown  Parameters 


From  (103) 


||x(i)H2  <  PM 


1 


Pm  min of  \\xk\\i 
Finally,  from  the  definition  of  u 

|us,fc(i)|2  <  |;v>,V)lul  x( 


a2pW  \\x(U 


0 

7  Pm' 
(112) 


<  maX||xfc||i||x(i)||2.  (113) 


From  (112),  (62),  and  the  choices  of  7  in  (86)  and  a  in  (60) 


This  example  demonstrates  the  concept  of  a  direction  consis¬ 
tent  constraint  mechanism  for  an  unstable  second-order  plant. 
The  control  objective  is  to  restrict  the  states  of  the  system  to 
follow  specified  sinusoidal  trajectories  that  are  out-of-phase. 
The  nominal  plant  used  in  the  simulation  is  specified  as 


A  = 
B  = 


0.2 

0.1 

1 

-2 


0.1 

-0.5 

3" 

1  ' 


(118) 

(119) 


|us,fc(i)|2  <  —(3ka2p22u2mk  +  <^~  (114) 

Pm  'l  Pm 

<PWmk ■  (115> 

Thus,  every  kth  component  of  the  control  signal  stays  inside  the 
bounds 

~P2umk  <  us,fc  <  P2Umk  (116) 

for  all  t  >  t*  such  that  u a(t)  =  u s(t).  Therefore,  it  is  guar¬ 
anteed  that  on  enforcing  the  control  saturation  condition  for  the 
adaptive  case,  the  control  signal  stays  within  the  specified  limits. 

VII.  Numerical  Examples 
A.  Purpose  and  Scope 

Validation  of  the  theoretical  developments  presented  above 
is  demonstrated  in  this  section  through  simulation.  The  exam¬ 
ples  demonstrate  the  direction  consistent  mechanism  for  two 
unstable  systems.  The  first  example  is  a  generic  second-order 
plant.  The  tracking  results  are  studied  for  two  sinusoidal  tra¬ 
jectories  of  different  magnitudes.  The  purpose  of  this  example 
is  to  simulate  the  response  of  the  closed-loop  system  for  two 
cases;  one  with  tracking  control  within  bounds  and  the  other 
with  tracking  control  outside  control  limits.  Whenever  the 
tracking  control  is  within  bounds,  it  is  expected  that  the  system 
demonstrates  perfect  tracking.  Direction  consistency  is  demon¬ 
strated  for  reference  trajectories  that  require  more  control  effort 
than  that  available. 


The  true  plant  matrices  are  randomly  generated  with  an  uncer¬ 
tainty  of  5%  in  each  element  of  A  and  B  for  the  simulation.  The 
control  vector  is  symmetrically  bounded  between  [  ±  1  ±  1  ] T . 

1)  Case  1(a):  The  peak-to-peak  amplitudes  of  reference  for 

both  the  states  is  chosen  to  be  2.  The  frequency  of  oscillation 
for  the  specified  reference  is  1  rad/s  and  the  two  references  are 
out-of-phase.  A  is  chosen  as  0.2.  The  other  design  variables  are 
a  =  0.8856,  s\  =  0.97,  S2  —  0.98,  pi  =  0.985,  andp2  =  0.99, 
with  the  adaptive  gains  chosen  to  be  7  =  1,  —  2,  and 

a 2  =  0.2.  The  initial  conditions  of  the  system  state  is  chosen 
as  x(0)  =  [0.02  0.02] T.  Figs.  4  and  5  present  results  for  this 

case.  The  plant  state  in  this  case  perfectly  follows  the  refer¬ 
ence  and  u mp  always  remains  less  than  unity.  Moreover,  since 
the  tracking  control  computed  using  dynamic-inversion  always 
remains  within  bounds,  the  applied  control  smoothly  follows 
it.  Further,  the  adaptive  parameters  are  seen  to  remain  within 
pre-computed  bounds.  Since  the  input  is  not  sufficiently  rich, 
the  adaptive  parameters  shown  in  Fig.  5  do  not  converge  to  true 
values. 

2)  Case  1(b):  In  this  case  the  peak  amplitudes  of  the  refer¬ 
ence  is  increased  to  observe  direction  consistency.  The  ampli¬ 
tudes  are  chosen  as  xre/i  =  1.5  and  xre/2  =  2  with  same 
frequency  of  oscillations  as  in  the  previous  case.  Figs.  6  and 
7  present  the  results.  The  plant  state  remains  bounded  and  di¬ 
rection  consistent  with  the  desired  reference  when  switching 
control  is  implemented.  With  the  switching  control  law  and 
direction  consistent  mechanism  the  applied  control  smoothly 
switches  from  tracking  control  to  the  saturated  control  and  is 
direction  consistent  with  the  desired  tracking  control,  without 
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(a) 


Time(sec) 

(b) 


Fig.  4.  Case  1(a)  plant  and  control  time  histories,  (a)  Case  1(a):  Plant  State,  (b) 
Case  1(a):  Applied  Control. 


any  control  chattering  as  seen  in  Fig.  6(b).  Also  observe  that 
since  us  remains  within  bounds,  the  ratio  UmV  always  remains 
less  than  1. 

Fig.  7  show  the  update  of  the  adaptive  parameters  0  and  </>. 
The  parameter  projection  successfully  restricts  the  adaptive  pa¬ 
rameters  within  the  parameter  bounds.  The  adaptive  parameters 
do  not  show  any  definite  trend  in  the  update.  The  important  thing 
is  to  note  that  parameter  convergence  to  a  constant  was  demon¬ 
strated  even  in  presence  of  errors  due  to  control  saturation. 

C.  F-16XL  Aircraft 

The  objective  is  to  command  an  aggressive  maneuver  which 
will  saturate  the  controls.  Using  the  F-16XL  (see  Fig.  8),  the 
commanded  maneuver  is  a  bank  angle  doublet  of  =b  60  deg  while 
simultaneously  turning  through  a  heading  angle  of  —20  deg. 
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Fig.  5.  Case  1(a):  adaptive  parameter  time  histories,  (a)  Case  1(a):  Adaptive 
Parameter  0.  (b)  Case  1(a):  Adaptive  Parameter  4>. 


The  control  effectors  used  here  are  aileron  8aa  and  differen¬ 
tial  elevon  8 ea  •  While  rudder  is  available  as  a  control  effector, 
it  is  not  used  here  for  the  manueuver  which  constists  primarily 
of  rolling.  The  F-16XL  linear  model  is  displayed  in  the  Ap¬ 
pendix.  All  states  and  controls  are  perturbations  from  the  steady, 
level,  1-g  trimmed  flight  states  given  in  Table  I.  The  open-loop 
eigenvalues  are  Ai?2  =  —0.321  =b  3.609j,  A3  =  —1.0138,  and 
A4  =  -0.032. 

1)  Controller  Synthesis:  A  reduced-order  linear  model  is 
used  to  develop  the  controller.  For  the  strictly  lateral/directional 


VALASEK  et  al :  ADAPTIVE  DYNAMIC  INVERSION  CONTROL  OF  LINEAR  PLANTS 


929 


0.016 


$  0014 


0.012 


I  0.l5r- 

1 

Li  iou- 


0  50 

Time  (sec) 


0.13 


0  50 

Time(sec) 


-0.012 


CM 

si 

I  -0.014 

1 


-0.016 


Time(sec) 


0  50 

Time(sec) 


(a) 


0.15 

15  0.14 

2 

CL 


0.13 


-0.4 

c\T 

(S 

1  -0.44 

CL 

I 

-0.46 

0  50 

Time(seQ) 


0  50 

Time(seo) 


0.26 


0  50 

Time(sec) 


0.1 3. 


_ 

0  50 

Time(sec) 


(b) 


Fig.  7.  Case  1(b)  adaptive  parameter  time  histories,  (a)  Case  1(b):  Adaptive 
Parameter  6.  (b)  Case  1(b):  Adaptive  Parameter  <j>. 


Fig.  6.  Case  1(b)  state  and  control  time  histories,  (a)  Case  1(b):  Plant  State,  (b) 
Case  1(b):  Computed  Control. 


maneuver  performed  here,  the  longitudinal  dynamics  are  ne¬ 
glected.  The  model  is  cast  into  a  structured  form  as  a  kinematic 
part  and  a  dynamic  part  using  only  the  roll  rate  and  yaw  rate 
states.  However,  the  full  model  is  used  for  simulation. 

Kinematic  part 


‘  ■  -  ■  ■■ 

/M 

:  -  ■;  i  ^wrYilrK 

yt 


4>  =p 

tjj  —  r. 

Dynamic  part 

$AA 

8ea 


(120) 

(121) 


(122) 


where  (j)  is  the  bank  angle,  ^  is  the  heading  angle,  and  p  and  r 
are  the  roll  and  yaw  rates,  respectively.  The  control  effectors  are 
limited  to  maximum  position  limits  of  ±  25  deg.  Uncertainty  in 
the  aircraft  dynamical  model  is  addressed  by  randomly  intro¬ 
ducing  errors  into  the  stability  and  control  derivatives  during 
numerical  simulation.  The  reference  trajectory  is  specified  in 


Fig.  8.  F-16XL  external  physical  characteristics. 


TABLE  I 
Trim  State 


Parameter 

Value 

Mach 

Altitude 

Angle-of-Attack 

Aileron  Saa 

Differential  Elevon  Sea 

0.90 

25,000  ft 

4.61  degrees 

0  degrees 

0  degrees 

terms  of  </>r,  i/y,  pr ,  rr,  pr ,  and  fy.  The  control  law  and  the 
update  laws  for  the  adaptive  parameters  are  developed  in  ac¬ 
cordance  with  the  theory  developed  in  the  earlier  sections.  For 
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Fig.  9.  Case  2(a)  Roll  rate  and  Bank  Angle  for  F-16XL. 


Time(sec) 

Fig.  10.  Case  2(a)  Yaw  rate,  heading  angle,  and  sideslip  angle  for  F-16XL. 


brevity  only  the  equations  required  for  incorporating  the  control 
law  in  the  simulation  are  presented  here.  The  tracking  errors  are 
defined  as 


ei  =  (j)  —  <fir  (123) 

e2  =  V>  ~  ifrr-  (124) 


To  track  the  kinematic  angles,  the  closed-loop  dynamics  are 
specified  as 


ei 

e'2 


+  Kd 


ei 

e2 


=  0. 


(125) 


The  design  parameters  are  X,  Kdi  ai,  a2,  pi,  p2i  and  <s2. 
The  tracking  and  saturated  control  are 


u  t  =  -  0 


us 


+  $$ 


(126) 


+  (1  —  ump)  Satdc(di?)  (127) 


where 


d  = 


Pr  H-  Apr 

rr  +  A  r> 


Kde  i 
Kde2 


(128) 


The  adaptive  laws  for  the  elements  of  6  and  4>  are  given  by 
(84)-(85)  with 


P~Pr 
r  —  vr 


(129) 


and 


(@v 


0v)=(0-0) 


p 

r 


(130) 

(131) 


2)  Results  and  Discussion:  Case  2(a)  No  Switching  Control 
Law  For  this  case,  once  the  control  saturates  maximum  con¬ 
trol  is  applied  until  the  tracking  control  falls  back  into  limits. 
Figs.  9-11  show  that  for  the  first  10  s  the  tracking  control  stays 
within  bounds  and  the  state  is  bounded.  After  10  s,  the  tracking 
control  required  is  large  and  the  plant  state  diverges  away  from 
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Fig.  11.  Case  2(a)  Aileron  and  Differntial  Elevon  for  F-16XL. 


the  reference.  Notice  that  without  the  switching  control  law  the 
system  becomes  unstable. 

Case  2(h)  Switching  Control  Law  In  this  case  the  switching 
control  law  is  implemented.  The  design  constants  are  chosen 
as  7  =  80  which  gives  the  value  of  a  =  0.8582,  —  10, 

<t2  =  10,  si  =  0.9,  s2  =  0.92,  pi  =  0.96,  p2  =  0.98,  A  =  10, 
and  Kd  =  10.  Figs.  12-16  present  the  simulation  results.  At  2 
s  the  aircraft  is  commanded  to  roll  at  112  deg/s  to  an  angle  of 
—60  deg  and  simultaneously  turning  to  a  heading  of  —20  deg  at 
a  yaw  rate  of  4.2  deg/s.  Notice  that  the  tracking  control  required 
to  perform  the  maneuver  is  beyond  the  position  limits  specified, 
so  the  saturated  control  is  applied  since  u^p  =  0.9081.  This 
is  greater  than  the  specified  s  i.  The  effect  of  applying  this  con¬ 
trol  is  that  the  aircraft  performs  the  roll  at  a  reduced  rate  of  80 
deg/s.  At  2.7  s  the  tracking  control  lies  within  limits  and  the 
applied  control  stays  there  afterwards.  The  bank  and  heading 
angles  settle  down  to  their  respective  desired  values  at  10  s. 

After  10  s  the  reference  trajectory  changes  direction  yet  the 
system  responds  accordingly  and  starts  to  track  closely.  The  air¬ 
craft  is  commanded  to  bank  60  deg  at  the  rate  of  1 10  deg/s,  while 
simultaneously  turning  to  3.5  deg  at  a  yaw  rate  of  5.4  deg/s. 
Since  the  tracking  control  is  outside  position  limits,  the  satu¬ 
rated  control  is  applied.  As  before  the  roll  is  performed  at  a  re¬ 
duced  rate  of  67.5  deg/s.  The  bank  and  heading  angles  settle  to 
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Fig.  12.  Case  2(b)  Roll  rate  and  bank  angle  of  F-16XL. 


Fig.  13.  Case  2(b)  Yaw  rate,  heading,  and  sidelslip  for  F-16XL. 
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Fig.  14.  Case  2(b)  Aileron  and  differential  elevon  for  F-16XL. 


their  respective  values  at  20  s,  after  which  the  aircraft  is  com¬ 
manded  to  return  to  wings  level,  i.e.,  a  0  deg  bank.  This  is  com¬ 
manded  at  a  rate  of  55  deg/s  and  the  tracking  control  is  applied 
within  position  limits.  It  is  important  to  note  that  during  this 
30  s  time  span  consistency  with  the  reference  is  preserved,  and 
control  chattering  is  avoided.  Even  though  sideslip  angle  (3  is 
not  directly  controlled,  it  remains  within  bounds  and  well  be¬ 
haved  throughout  the  manuever.  This  is  because  the  system  is 
linear  and  minimum-phase  so  the  internal  dynamics  are  stable. 
Further,  throughout  the  maneuver  is  maintained  less  than 
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Fig.  16.  Case  2(b)  Adaptive  parameters  0  for  F-16XL. 
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Fig.  17.  Case  2(b)  Adaptive  parameter  <t>  for  F-16XL. 


1 .  The  adaptive  parameters  do  not  converge  to  their  true  values 
within  the  duration  of  the  maneuver,  because  the  reference  tra¬ 
jectory  is  not  persistently  exciting.  However,  this  is  immaterial 
as  asymptotic  trajectory  tracking  can  be  achieved  irrespective 
of  parameter  convergence. 
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VIII.  Conclusion 

Based  on  the  stability  proofs  and  the  simulation  results 
presented  in  the  paper,  if  the  control  is  unsaturated  the  tracking 
error  asymptotically  goes  to  zero,  and  all  signals  in  the  control 
scheme  are  bounded.  If  the  control  is  saturated,  the  plant  state 
can  only  be  guaranteed  to  be  bounded  and  direction  consistent 
with  the  desired  reference.  The  switching  control  strategy 
successfully  restricts  the  state  within  the  DC  A.  The  transition 
between  the  tracking  control  and  the  stability  control  is  smooth, 
and  the  applied  control  does  not  show  any  chattering.  The  in¬ 
stability  protection  switching  control  law  can  be  implemented 
without  prior  explicit  identification  and  bookkeeping  of  the 
DC  A  boundary.  The  control  laws  developed  in  this  paper 
are  applicable  for  unstable  controllable  linear  time-invariant 
square  non-singular  systems  with  uncertainty  within  ±5%  of 
the  known  nominal  model. 


Appendix 

The  F-16XL  aircraft  lateral/directional  model 
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All  angular  quantities  are  in  radians. 
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Reinforcement  Learning  for  control  of  dynamical  systems  is  popular  due  to  the  ability 
to  learn  control  policies  without  requiring  a  model  of  the  system  being  controlled.  It 
can  be  difficult  to  learn  ideal  control  policies  because  it  is  common  to  abstract  out  or 
ignore  completely  the  dynamics  of  the  agents  in  the  system.  In  this  paper,  Reinforcement 
Learning-based  algorithms  are  developed  for  learning  agents’  time  dependent  dynamics 
while  also  learning  to  control  them.  Three  algorithms  are  introduced.  Sampled-Data 
Q-learning  is  an  algorithm  that  learns  the  optimal  sample  time  for  controlling  an  agent 
without  a  prior  model.  First-Order  Dynamics  Learning  is  an  algorithm  that  determines  the 
proper  time  constants  for  agents  known  to  have  first-order  dynamics,  while  Second-Order 
Dynamics  Learning  is  an  algorithm  for  learning  natural  frequencies  and  damping  ratios 
of  second-order  systems.  The  algorithms  are  demonstrated  with  numerical  simulation. 
Results  presented  in  this  paper  show  that  the  algorithms  are  able  to  determine  information 
about  the  system  dynamics  without  resorting  to  traditional  system  identification. 

I.  Introduction 

In  recent  years,  Reinforcement  Learning  (RL)  has  been  an  extensively  investigated  area  of  research  in 
the  field  of  Machine  Learning.  It  has  been  a  popular  tool  for  solving  problems  such  as  dynamical  system 
control,  gain  scheduling,  maze  navigation,  and  game  playing.  There  has  been  wide  success  in  many  of 
the  applications,  but  researchers  have  often  encountered  problems  when  casting  the  control  of  dynamical 
systems  as  an  RL  problem.  The  state-to-action  mapping  provided  by  RL  techniques  makes  the  use  for 
control  problems  attractive,  and  this  is  especially  the  case  with  Q-learning  due  to  its  proven  convergence  to 
optimality.  RL  methods  like  Q-learning  are  appealing  because  they  can  achieve  this  mapping  experimentally 
without  the  need  of  a  model.  Although  this  is  indeed  the  case,  implementing  these  in  practice  has  proven 
to  be  very  difficult. 

Studying  the  problems  associated  with  this  implementation  reveals  that  often  the  failure  to  implement  RL 
methods  in  dynamical  systems  are  not  caused  but  the  basic  approach  of  methods  like  Q-learning.  Typically, 
the  problem  is  either  a  failure  in  properly  representing  the  problem  or  inaccurate  function  approximation. 
When  choosing  to  implement  the  popular  Watkins’  Q-learning  in  a  dynamical  system  scenario,  it  is  necessary 
to  realize  that  the  algorithm  does  not  explicitly  account  for  time.  Time  dependency  is  often  either  overlooked 
or  handled  outside  of  the  learning  process,  but  accounting  for  time  in  the  selection  of  actions  (or  control)  is 
needed.  For  instance,  when  handling  sampled-data  systems,  small  changes  to  sample  time  can  cause  drastic 
changes  to  the  stability  of  the  control  policy  determined. 

One  research  area  that  has  recently  received  a  lot  of  attention  has  been  the  control  of  cooperative  multi¬ 
agent  systems  through  the  use  of  Q-learning-based  algorithms.  Learning  to  control  multiple  agents  for  the 
purpose  of  cooperatively  achieving  a  specified  goal  is  an  appealing  research  topic  with  high  complexity.  Some 
research  in  this  area  has  involved  comparing  the  effects  using  Q-learning-based  methods  to  determine  joint 
action  selection  between  different  agents  to  the  learning  of  agent  actions  independently.1  Other  research  has 
investigated  stochastic  game  extensions  to  this  and  treated  the  system  as  non-cooperative  by  having  agents 
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consider  only  themselves  with  no  knowledge  of  the  existence  of  other  agents.2  Systems  of  agents  that  need  to 
coordinate  their  actions  without  knowledge  of  each  other’s  actions  has  been  determined  to  be  an  important 
area  of  research  for  the  general  application  to  systems  of  agents  that  do  not  have  the  ability  to  communicate 
with  one  another.3,4  Even  so,  other  research  has  been  conducted  involving  the  improvement  of  Q-learning 
approaches  for  determining  joint  actions  through  the  use  of  Bayesian  inference  to  estimate  strategies.5 

In  most  of  the  research  scenarios  discussed  above,  multiple  agents  are  simulated  in  games  that  have 
no  dependence  on  time.  Time-dependent  agent  dynamics  cause  a  fundamental  change  to  the  system,  and 
considering  the  control  of  time  dependencies  in  multi-agent  systems  has  received  little  attention.  This  may 
be  due  to  the  fact  that  it  is  difficult  to  learn  control  policies  for  a  single  time-dependent  agent  using  RL 
approaches.  To  address  this,  a  topic  of  research  that  needs  to  be  investigated  is  the  learning  of  an  optimal 
sample  time  for  a  sampled-data  system.  Controlling  real  continuous  systems  generally  requires  computer- 
based  control,  so  sampling  of  the  continuous  system  is  necessary.  Without  considering  the  sample  time  used, 
problems  arise  in  attempting  to  use  a  policy  derived  in  simulation  on  an  actual  hardware  experiment.  For 
this  scenario,  if  a  model  of  the  dynamics  exists  it  is  trivial  to  determine  the  best  sample  time  by  classical 
methods.  However,  here  we  consider  the  case  where  a  model  is  not  available  and  RL  is  being  utilized  for 
its  model-free  approach.  This  requires  some  way  to  determine  the  optimal  sample  time  without  the  use  of  a 
model. 

One  more  area  of  investigation  that  is  needed  for  the  learning  of  multi-agent  system  control  policies  is 
to  learn  some  approximation  of  the  dynamics.  The  learning  process  does  not  explicitly  account  for  agent 
time  dynamics,  so  that  information  is  essentially  abstracted  out  of  learning.  Since  the  main  benefit  of 
RL  approaches  like  Q-learning  is  to  learn  a  control  policy  without  the  need  of  a  model,  the  benefits  of 
determining  a  model  have  been  overlooked.  It  can  be  very  useful  for  some  knowledge  of  the  dynamics  to 
inform  the  decisions  made  by  the  agents,  and  this  is  especially  true  in  heterogeneous  multi- agent  systems. 
A  full  model  may  not  be  necessary,  but  some  approximation  of  the  individual  agent  dynamics  can  be  very 
beneficial  for  determining  global  behavior  rules. 

In  this  paper,  RL-based  control  approaches  are  extended  to  systems  with  unknown  time  dynamics. 
The  scope  of  this  paper  is  limited  to  the  cases  of  simulated  examples  where  the  simulations  have  time 
dynamics  but  the  RL  agent  learning  to  control  the  system  has  no  access  to  that  information.  The  systems 
considered  are  all  sampled-data  systems,  and  they  will  exhibit  first-  or  second-order  dynamics  depending 
upon  the  algorithm  being  investigated.  Three  algorithms  are  introduced.  One  is  capable  of  determining 
the  optimal  sample  time  for  these  sampled-data  systems  using  RL-based  algorithms  while  simultaneously 
learning  the  control  policy,  and  it  is  named  Sampled-Data  Q-learning  (SDQL).  The  second  is  an  RL-based 
algorithm  capable  of  learning  some  approximation  of  agent  dynamics  for  a  first-order  system,  and  it  is  called 
First-Order  Dynamics  Learning  (FODL).  The  final  algorithm  is  similar  to  the  FODL  algorithm  and  is  for 
second-order  systems,  called  Second-Order  Dynamics  Learning  (SODL).  The  end  result  of  this  research  is 
the  ability  to  learn  an  optimal  sample  time  for  agents,  an  approximation  of  agents’  time  dynamics,  and 
individual  agent  control  policies.  This  collective  result  allows  for  the  control  of  heterogeneous  multi- agent 
systems  by  means  of  hierarchical  commands  provided  by  a  high-level  agent  with  all  of  the  knowledge  learned 
by  these  algorithms. 

This  paper  is  organized  as  follows.  In  Section  II,  the  basics  of  Reinforcement  Learning  are  discussed, 
with  an  emphasis  on  Q-learning.  Section  III  introduces  the  Sampled-Data  Q-learning  algorithm  and  in¬ 
cludes  simulation  results  to  demonstrate  it.  The  First-  and  Second- Order  Dynamics  Learning  algorithms  are 
introduced  in  Section  IV  with  accompanying  results,  and  conclusions  and  open  challenges  are  discussed  in 
Section  V. 


II.  Q-learning 

There  are  multiple  classes  of  algorithms  that  fall  within  the  definition  of  Reinforcement  Learning.  Cur¬ 
rently,  the  RL  algorithms  that  are  most  used  in  research  are  Temporal-Difference  (TD)  methods.  TD  methods 
are  actually  a  conceptual  combination  from  two  other  classes  of  algorithms  known  as  Dynamic  Programming 
(DP)  and  Monte  Carlo.6  Like  Monte  Carlo,  TD  methods  use  experience  through  interaction  with  the  system 
to  update  the  quality  of  the  value  function  without  the  need  of  a  model.  Like  DP,  TD  methods  do  not 
have  to  wait  until  the  end  to  improve  the  value  function,  but  rather  update  it  along  the  way.  This  improves 
convergence  time  and  also  makes  TD  methods  usable  in  online  learning.  In  this  section,  the  TD  algorithm 
known  as  Q-learning  will  be  discussed,  along  with  the  limitations  that  lead  to  the  development  of  extended 
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Q-learning-based  algorithms  for  the  use  of  learning  in  dynamical  sampled-data  systems. 

Of  the  various  formulations  of  RL  algorithms,  Watkins’  Q-learning  has  been  the  most  accepted  and 
utilized  algorithm  for  its  proven  convergence  to  the  optimal  action-value  function.6  Q-learning  is  a  TD 
method  that  learns  the  optimal  action- value  function  in  an  off-policy  manner.7  This  means  that  the  policy 
used  during  a  learning  episode  is  not  necessarily  the  same  as  the  one  that  is  updated  at  each  timestep.  A 
similar  implementation  that  uses  on-policy  learning  is  known  as  Sarsa,  and  is  often  used  in  the  same  learning 
situations  as  Q- learning. 8,9  The  Q-learning  algorithm  is  based  upon  an  action- value  update  rule  that  uses  a 
greedy  policy  to  determine  a  predicted  value  for  the  state-action  pair  at  the  next  (future)  timestep.10, 11  The 
actual  action  selection  may  not  be  done  using  a  greedy  policy,  and  in  fact  it  is  typically  better  for  optimality 
to  include  some  degree  of  exploration  in  the  policy.12  The  rule  used  for  updating  the  action- value  function 
is  as  follows. 


Q(s,  a)  <—  Q(s,  a)  +  a[r  +  yma xQ(s',  a')  -  Q(s ,  a)}  (1) 

a' 

In  Equation  1,  Q  is  the  action- value  function,  s  is  the  state  at  the  current  timestep,  a  is  the  action 
selected  for  the  current  timestep  using  the  agent’s  policy  (e.g.,  ^-greedy),  a  is  the  step-size  parameter,  r  is 
the  reward  received  from  the  system,  s'  is  the  future  state  (due  to  taking  action  a),  a'  is  the  action  that 
would  be  taken  using  a  greedy  policy  when  in  state  s',  and  7  is  the  weight  for  the  future  value.  The  selection 
of  7  affects  convergence  time,  and  it  is  always  within  the  range  (0,1).  The  value  of  7  can  either  be  kept 
constant  throughout  learning,  or  it  can  be  chosen  to  vary  episodically  so  that  later  learning  episodes  value 
the  future  prediction  differently  than  early  episodes.  The  value  of  a  is  always  in  the  interval  (0,1),  and  can 
either  be  held  constant  or  varied  by  some  user-defined  function.  In  some  cases,  a  is  designed  to  decrease 
within  an  episode  according  to  how  often  the  agent  revisits  the  same  state,  essentially  punishing  the  agent 
for  repeating  itself  unnecessarily. 

The  update  rule  of  Equation  1  is  the  backbone  of  the  famous  Watkins’  Q-learning  algorithm.  It  is  an 
off-policy  TD  algorithm  with  a  user-determined  policy  for  selecting  actions  at  each  timestep,  but  uses  a 
fully-greedy  policy  with  the  action-value  function  when  updating  the  action-value  function.  Rather  than 
utilizing  past  information  to  perform  this  update,  this  algorithm  uses  the  predicted  future  state-action  pair 
chosen  greedily.  The  Watkins’  Q-learning  algorithm  is  displayed  in  Algorithm  1. 

Algorithm  1  Q-learning6 

•  Initialize  Q(s,a)  arbitrarily 

•  Repeat  for  each  episode: 

—  Initialize  s 

—  Repeat  for  each  timestep: 

*  Choose  a  from  s  using  policy  derived  from  Q(s,a)  (e.g.,  ^-Greedy) 

*  Take  action  a,  observe  r,  s' 

*  Q(s,a)  <—  Q(s,a)  +  a  [  r  +  7  maxa/  Q(s',a')  -  Q(s,a)  } 

*  s  <—  s' 

—  Until  s  is  terminal 


The  policy  for  action  selection  has  a  drastic  effect  on  the  convergence  of  the  Q-learning  algorithm. 
Some  balance  of  exploration  and  exploitation  is  needed  to  properly  learn  the  full  state-space  and  guarantee 
convergence  to  the  optimal  policy.  An  ^-greedy  policy  uses  a  user-defined  probability,  5,  that  determines 
whether  to  choose  actions  randomly  or  according  to  the  action- value  function  each  time  a  new  action  must 
be  taken.  This  speeds  up  the  convergence  time  by  reinforcing  paths  that  have  already  been  designated  either 
good  or  bad  while  still  allowing  for  new  paths  to  be  explored  for  optimality.13  In  the  early  episodes  the 
agent  is  required  to  explore  due  to  lack  of  knowledge,  regardless  of  the  value  assigned  to  5.  This  6-greedy 
policy  is  used  as  the  action  selection  method  for  Q-learning  in  this  paper. 
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III.  Sampled-Data  Q-learning 


When  Algorithm  1  is  used  in  dynamical  systems,  the  problem  of  handling  time-dependencies  arises. 
Because  it  is  much  faster  (and  safer)  to  perform  the  learning  in  a  computer  simulation  rather  than  online 
with  a  hardware  system,  researchers  often  turn  to  using  dynamics  models.  The  issue  of  ensuring  an  accurate 
model  is  obvious,  but  it  is  often  overlooked  that  using  the  final  control  policy  on  a  hardware  model  results 
in  a  sampled-data  system.  The  control  policy  is  handled  using  a  discrete  computer,  but  the  policy  is  learned 
assuming  continuous  dynamics.  In  some  cases,  the  user  realizes  this  and  assumes  a  sampled-data  system  in 
the  simulated  learning,  but  the  Q-learning  algorithm  will  converge  to  a  policy  that  assumes  the  same  sample 
time  will  always  be  used  and  that  the  chosen  sample  time  is  best.  Here,  an  attempt  to  overcome  this  issue 
is  addressed  by  wrapping  the  sample  time  into  the  learning  process. 

A.  Sampled-Data  Q-learning  Algorithm 

Incorporating  the  sample  time,  denoted  T,  into  the  learning  process  requires  determining  the  value  of  in¬ 
dividual  sample  times  without  adversely  affecting  the  stability  of  the  system  during  an  episode.  It  would 
therefore  be  wise  to  not  allow  a  sample  time  to  change  during  a  single  episode.  However,  to  determine 
optimal  action- value  functions  for  a  range  of  sample  times  requires  incorporating  it  into  the  state-space.  It  is 
therefore  necessary  to  append  the  state-space  with  T  while  not  allowing  it  to  be  affected  by  the  action-space. 

This  gives  rise  to  the  question  of  how  a  particular  sample  time  is  to  be  selected  when  it  cannot  be 
affected  by  the  action-space.  It  is  necessary  that  a  value  be  associated  with  each  possible  selection  of  T, 
but  T  must  be  held  constant  throughout  an  episode.  It  is  therefore  proposed  that  a  state-value  function  for 
T  be  determined  using  Monte  Carlo-based  learning.6  The  value  function  can  be  updated  according  to  an 
every- visit  Monte  Carlo  method,  shown  in  Equation  2. 

Vt(T)  <-  Vt(T)  +  a{R  -  Vt(T))  (2) 

This  update  rule  will  be  the  basis  for  Sampled-Data  Q-learning.  At  the  beginning  of  each  episode,  the 
sample  time  value- function,  Vr,  can  be  used  to  select  T  for  the  episode  according  to  a  user-defined  policy. 
The  sample  time  is  appended  to  the  system  state  vector,  s,  so  that  the  normal  Q-learning  update  rule  will 
determine  separate  control  policies  according  to  different  values  of  T.  When  the  reward  for  a  given  timestep 
is  determined,  it  is  used  to  update  both  Q,  and  the  total  rewards  for  the  episode  are  updated.  At  the  end 
of  the  episode,  Vr  is  updated  by  using  the  average  return,  R.  This  causes  Vr  to  be  updated  such  that 
episodes  which  experience  more  positive  rewards  than  negative  rewards  result  in  the  value  associated  with 
that  particular  T  increasing.  Likewise,  when  an  episode  experiences  more  negative  rewards  than  positive, 
the  total  value  for  T  after  the  episode  ends  will  have  decreased.  Using  this  new  sampled-data  value  function 
and  update  rule,  the  Sampled-Data  Q-learning  algorithm  becomes  as  shown  in  Algorithm  2. 

B.  SDQL  Results 

For  the  system  described  above,  a  single  agent  was  simulated  as  a  robot  that  translates  forward  with  a 
constant  speed  and  rotates  the  heading  angle  ^  to  change  direction.  The  rotational  dynamics  are  described 
as  a  first-order  differential  equation  with  time  constant  r.  The  environment  the  robot  is  allowed  to  traverse 
is  a  20m  by  20m  square  with  the  origin  at  the  center.  The  governing  equations  of  motion  are  shown  in 
Equations  3-5,  where  the  subscript  c  denotes  the  commanded  value. 


x  =  V  cos  'ip 

(3) 

y  =  V  sin^ 

(4) 

ip  =  (V’c  -  VA_1 

(5) 

At  each  timestep,  the  learner  evaluates  the  current  state  and  chooses  the  appropriate  action  based  on 
an  ^-greedy  policy  in  a  Q-learning  scheme.  The  possible  actions  are  rotate  clockwise  (— A^),  no  rotation 
(A 'ip  =  0),  and  rotate  counterclockwise(+A'0).  At  each  T,  the  robot  is  required  to  take  a  new  action,  which 
corresponds  to  a  new  commanded  next  state.  The  commanded  next  action  occurs  in  intervals  of  45  degrees. 
The  action-space  is  shown  in  Equation  6. 
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Algorithm  2  Sampled-Data  Q-learning  (SDQL) 

•  Initialize  Q(s,a)  arbitrarily 

•  Initialize  Vt(T)  arbitrarily 

•  Repeat  for  each  episode: 

—  Choose  T  using  policy  derived  from  Vt(T)  (e.g.,  ^-Greedy) 

—  Initialize  R  =  0 

—  Initialize  s,  initialize  s  by  appending  T  to  s 
—  Repeat  for  each  sample  timestep,  T: 

*  Choose  a  from  s  using  policy  derived  from  Q(s,a)  (e.g.,  ^-Greedy) 

*  Take  action  a,  observe  r,  s' 

*  Q(s,a)  <—  Q(s,a)  +  a  [  r  +  7  maxa/  Q(s',a')  -  Q(s,a)  } 

*  s  <—  s' 

—  Until  s  is  terminal 
—  R  =  average  (r) 

-  Vt(T)  <-  Vt(T)  +  a[R-  Vt(T)  } 


a  e  A  =  [  -45°  0°  +45°  ]  (6) 

After  simulating  this  problem  using  the  SDQL  algorithm,  the  result  is  a  determined  best  sample  time 
and  a  control  policy  based  on  that  sample  time.  The  values  for  the  individual  sample  times  after  10,000 
learning  episodes  are  shown  in  Table  1. 


Table  1.  SDQL  Robot  Result 


T(sec) 

VT 

0.01 

29.1 

0.02 

29.9 

0.03 

2.4 

0.04 

42.9 

0.05 

51.4 

0.06 

73.4 

0.07 

2271.8 

0.08 

14.7 

0.09 

29.2 

0.10 

120.8 

The  control  policy  has  the  ability  to  control  the  robot  to  move  from  randomly  initialized  points  to  the 
goal  within  a  tolerance  of  ±lm.  The  sample  time  with  the  maximum  value  determined  by  Vr  of  T  =  0.07 
sec  is  in  this  simulation  of  the  resulting  Q  function.  Figures  1-6  demonstrate  the  ability  to  control  the  robot 
to  the  goal  using  this  learned  function  for  the  determined  sample  time. 

Figures  1-3  show  that  the  robot  is  able  to  guide  itself  to  the  goal  from  an  initial  point  in  quadrant  1.  The 
state  time  history  shown  in  Figure  2  shows  that  the  robot  is  able  to  guide  itself  there  in  under  8  seconds  of 
simulated  real-time.  The  commanded  heading  angle  changes  at  each  T  =  0.07  sec  are  shown  in  Figure  3. 

For  further  demonstration  of  the  ability  to  control  the  robot,  an  initial  condition  beginning  in  quadrant 
3  was  tested.  These  results  are  shown  in  Figures  4-6.  Figure  5  shows  that  the  robot  is  able  to  reach  the  goal 
in  this  case  in  approximately  15  seconds,  and  the  commanded  heading  angle  history  is  shown  in  Figure  6. 
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Figure  2.  State  History  of  Robot  -  Q1  Initial  Condition 


IV.  System  Dynamics  Approximation 

When  using  Q-learning  to  determine  a  control  policy  for  dynamical  systems,  the  benefit  of  not  needing  to 
have  a  model  of  the  dynamics  can  lead  to  neglecting  the  dynamics  completely.  Learning  a  control  policy  for 
these  systems  is  necessary,  but  often  it  is  desired  to  also  determine  some  approximation  of  the  dynamics.  This 
is  especially  important  when  dealing  with  the  control  of  heterogeneous  multi-agent  systems  since  coordinating 
the  agents  requires  knowing  how  the  individual  agents  respond  differently  in  time  to  similar  action  inputs. 

A.  First-Order  Dynamics  Learning 

The  simplest  agent  dynamics  to  represent  are  first-order  dynamics.  In  a  stable  first-order  system,  the 
dynamics  are  described  by  the  differential  equation  shown  in  Equation  7.  Given  the  command  value  sc,  the 
time  constant  r  is  needed  to  fully  describe  this  differential  equation. 


T8  +  s  =  sc  (7) 

The  solution  to  this  ODE  is  shown  in  Equation  8.  This  solution  can  be  used  to  determine  the  next  state, 
but  r  is  required  to  do  so.  This  makes  learning  an  approximate  time  constant  all  that  is  needed  to  determine 
the  time  behavior  of  the  agent.  Given  the  current  state,  8,  the  predicted  next  state,  8*,  can  be  approximated 
using  the  current  guess  of  the  time  constant,  r,  and  the  sample  time  period,  T,  according  to  Equation  8. 

s*  =  se~T/T  +  (1  -  e~T/T)sc  (8) 
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Figure  3.  Command  History  of  Robot  -  Q1  Initial  Condition 


Equation  8  can  be  used  to  approximate  a  state  transition  for  a  single  1-dimensional  state.  Learning  to 
approximate  the  time  constant  requires  determining  a  reward  that  is  a  function  of  the  current  estimate  of  the 
sample  time.  By  estimating  the  next  state  using  Equation  8  and  the  current  estimate  of  the  time  constant, 
the  reward  can  be  shaped  by  the  error  between  the  measured  true  next  state  and  the  estimated  next  state. 
The  reward  function  used  is  shown  in  Equation  9. 

r  =  ||s*  -  s'||2  (9) 

If  more  than  one  state  dimension  is  to  be  approximated,  multiple  time  constants  would  be  needed.  For 
instance,  if  this  method  were  applied  to  a  robot  traversing  a  2-D  space  with  first-order  dynamics  in  forward 
translation  and  rotation,  one  might  want  to  know  the  state  transition  dynamics  of  both  Tforward  and  rrotate 
independently  as  they  would  most  likely  have  different  dynamics.  Time  constants  for  each  state- action  pair 
would  be  determined  based  on  whether  the  robot  were  moving  forward  or  rotating.  To  learn  a  time  constant, 
a  Monte  Carlo  learning  formulation  similar  to  the  SDQL  algorithm  can  be  used.  Algorithm  3  can  shows  how 
this  can  be  accomplished.14 

Algorithm  3  was  used  for  the  same  example  shown  in  Section  III.B.  After  the  10,000  learning  episodes 
completed,  the  FODL  algorithm  was  able  to  successfully  converge  to  the  proper  value  of  the  time  constant 
of  r  =  0.2  sec.  Table  2  shows  the  V\  values  associated  with  each  time  constant,  r. 

Over  the  course  of  the  10,000  learning  episodes,  the  value  associated  with  r  evolved  as  shown  in  Figure  7. 
As  can  be  seen,  the  value  associated  with  r  =  0.2  became  the  maximum  value  early  in  the  learning  process. 
As  learning  episodes  continue,  the  value  function  reinforces  this  as  the  best  estimate  of  the  time  constant. 
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Figure  5.  State  History  of  Robot  -  Q3  Initial  Condition 


Figure  6.  Command  History  of  Robot  -  Q3  Initial  Condition 


B.  Second-Order  Dynamics  Learning 

In  general,  real  world  systems  are  more  prone  to  exhibit  second-order  behavior  rather  than  first-order. 
Second-order  systems  are  the  simplest  systems  that  exhibit  overshoot  and  oscillations,  so  higher-order  systems 
can  be  approximated  using  second-order  models.15  It  is  therefore  necessary  to  learn  approximate  models 
of  second-order  systems  to  approximate  any  systems  that  are  higher  than  first-order.  The  second-order 
dynamics  can  be  described  by  Equation  10. 


d2s  ds  o  o 

d^  +  2CuJnTt+UJnS  =  UJnSc 


(10) 


To  approximate  a  second-order  system  requires  determining  2  parameters:  natural  frequency  and  damping 
ratio.  The  learning  framework  used  to  determine  these  parameters  is  the  same  as  for  the  first-order  case,  but 
the  value  function  and  the  state  approximation  equations  are  different.  The  value  function,  here  called  V2,  is 
a  function  of  the  frequency  and  damping  ratio.  The  state  prediction  equation  is  the  solution  to  Equation  10 
after  a  time  period  of  T.  This  solution  is  dependent  on  the  current  approximation  of  damping  ratio.  For  the 
case  of  (  =  0,  the  solution  is  as  shown  in  Equation  11. 


§ 

s*  =  sc  H - sin(o;nT)  +  (s  —  sc )  cos(c onT) 

For  the  case  of  0  <  £  <  1,  the  solution  is  as  shown  in  Equation  12. 


(ii) 
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Algorithm  3  First-Order  Dynamics  Learning  (FODL) 

•  Determine  T  and  Q(s,a)  for  system  (e.g.,  Sampled-Data  Q-learning) 

•  Initialize  V\  (s,a,r)  arbitrarily 

•  Repeat  for  each  episode: 

—  Initialize  8,  append  T  to  8 
—  Repeat  for  each  timestep: 

*  Choose  a  from  s  using  greedy  policy  derived  with  Q(s,a) 

*  Choose  r  using  policy  derived  from  V\ (s,a,r)  (e.g.,  ^-Greedy) 

*  Predict  next  state,  8*,  with  8, a,  and  r  using  first-order  approximations 

*  Take  action  a,  observe  actual  next  state,  s' 

*  Observe  r  shaped  from  observed  s'  and  predicted  8* 

*  Vi(s,a,r)  Vi(s,a,r)  +  a  [  r  -  V\  (s,a,r)  ] 

*8^8' 

—  Until  8  is  terminal 


Figure  7.  Time  Constant  Value  History 


S*  =  Sc  +  — J=  =  e  (S-Sc+  sin(wraT  A  -  C2)  (12) 

_l_e-C^„T(s  _  g j  cosKTVl  -  C2) 

For  the  case  of  £  =  1,  the  solution  is  as  shown  in  Equation  13. 

8  8C  T  (s  —  sc)e  u;'n'r  -|-  («s  (jOns  —  ujnsc)Te  UJnr^  (13) 

And  for  the  case  of  (  >  1,  the  solution  is  as  shown  in  Equation  14. 
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Table  2.  FODL  Robot  Value  Function 


r(sec) 

u 

0.1 

-1.79  x  104 

0.2 

-0.65  x  104 

0.3 

-1.12  x  104 

0.4 

-1.44  x  104 

0.5 

-1.49  x  104 

0.6 

-1.68  x  104 

0.7 

-1.75  x  104 

0.8 

-1.86  x  104 

0.9 

-1.94  x  104 

1.0 

-2.03  x  104 

1.1 

-2.05  x  104 

1.2 

-2.03  x  104 

1.3 

-2.07  x  104 

1.4 

-2.26  x  104 

1.5 

-2.22  x  104 

1.6 

-2.14  x  104 

1.7 

-2.11  x  104 

1.8 

-2.27  x  104 

1.9 

-2.19  x  104 

2.0 

-2.19  x  104 

S 


* 


where 


s  +  (2 (uin  -  z{)s-  scz2 
Z2  ~  Z1 

s  +  (2£cjn  +  ^2)5  —  scz\ 

Z<2  -  Z\ 


q-ziT 


0  —  Z2T 


zi  =  wra(C  -  Vc2  - 1) 
Z2  =  w„(C  +  \/C2  -  1) 


(14) 


To  learn  the  second-order  parameters,  the  FODL  algorithm  can  be  adjusted  for  learning  the  natural 
frequency  and  damping  ratio  rather  than  the  time  constant.  This  alternate  version  of  dynamics  learning, 
called  Second-Order  Dynamics  Learning,  is  shown  in  Algorithm  4. 

To  demonstrate  the  SODL  algorithm,  the  robot  example  from  before  was  altered  to  have  second-order 
dynamics  in  the  heading  angle  equation  of  motion.  For  this  example,  the  natural  frequency  of  the  robot 
heading  angle  was  set  to  uon  =  6  rad/sec  and  the  damping  ratio  is  (  =  0.8.  After  10,000  episodes  of  learning 
using  Algorithm  4,  the  value  of  V2  determined  a  maximum  value  associated  with  the  correct  frequency  and 
damping  ratio.  Table  3  shows  the  final  values,  and  they  are  plotted  in  Figure  8. 

The  way  that  the  V2  function  is  formulated  implies  that  the  value  determined  is  for  the  combination  of 
the  variables  u  and  (  rather  than  designating  values  for  each  separately.  This  is  done  because  it  is  both  of 
these  variables  together  that  determines  the  behavior  of  a  system,  and  not  either  one  separately.  However, 
if  one  were  to  want  to  see  how  they  are  valued  individually  then  the  average  values  can  be  determined.  If 
the  value  function  entries  for  each  instance  of  a  particular  uj  are  averaged  together,  an  estimate  of  the  value 
for  that  frequency  is  determined.  Likewise,  the  same  can  be  done  for  the  damping  ratio.  Figures  9-10  show 
how  the  average  values  for  the  individual  parameters  evolve  over  time. 
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Algorithm  4  Second-Order  Dynamics  Learning  (SODL) 

•  Determine  T  and  Q(s,a)  for  system  (e.g.,  Sampled-Data  Q-learning) 

•  Initialize  V2 (s,a,ujnX)  arbitrarily 

•  Repeat  for  each  episode: 

—  Initialize  s ,  append  T  to  8 
—  Repeat  for  each  timestep: 

*  Choose  a  from  s  using  greedy  policy  derived  with  Q(s,a) 

*  Choose  ujn  and  (  using  policy  derived  from  V2 (s,a,ujn£)  (e.g.,  ^-Greedy) 

*  Predict  next  state  s*  with  s,  a,  u;n,  and  (  using  ^-dependent  solution 

*  Take  action  a,  observe  actual  next  state  s' 

*  Observe  r  shaped  from  observed  s'  and  predicted  s* 

*  V2(s,a,un,C)  <-  V2  (s,a,u>n,C)  +  a[r  -  V2(s,a,u;n,()  } 

*  s  <—  s' 

—  Until  s  is  terminal 


Table  3.  V2  after  10,000  Episodes 


uj  —  2 

uj  =  4 

u  =  6 

uj  =  8 

cj  =  10 

II 

O 

4^ 

-3220 

-3107 

-2412 

-2671 

-5130 

II 

O 

bo 

-3450 

-2610 

-1466 

-2606 

-4110 

C  =  1.2 

-3283 

-2883 

-2393 

-2323 

-4092 

C  =  1-6 

-3008 

-2855 

-3180 

-2940 

-3155 

O 

CN 

II 

-3240 

-3602 

-2795 

-2576 

-3484 

V.  Conclusions  and  Open  Challenges 

The  work  presented  in  this  paper  has  shown  that  Reinforcement  Learning-based  techniques  can  be 
adapted  to  not  only  learn  control  policies  for  agents,  but  also  learn  an  approximation  of  the  agents’  dy¬ 
namics.  Several  conclusions  can  be  drawn  from  these  results.  The  Sampled-Data  Q-learning  algorithm  is 
capable  of  determining  longer  sample  times  that  still  allow  for  successful  control  of  the  agent.  The  First-Order 
Dynamics  Learning  algorithm  is  capable  of  determining  the  time  constants  that  best  model  the  dynamics 
of  the  states  for  an  individual  agent  with  first-order  dynamics.  The  Second- Order  Dynamics  Learning  algo¬ 
rithm  is  capable  of  determining  the  best  combination  of  natural  frequency  and  damping  ratio  to  model  the 
second-order  dynamics  of  the  states  for  an  agent. 

There  are  a  number  of  challenges  for  future  research  efforts.  First,  after  learning  the  dynamics  of  agents  it 
can  be  shown  that  in  a  hierarchical  multiagent  system  the  supervisory  agents  can  use  the  dynamics  informa¬ 
tion  to  determine  commands  to  lower- level  agents.  Adaptation  of  these  algorithms  to  stochastic  systems  is 
also  possible.  Another  open  problem  is  determining  a  means  to  either  demonstrate  the  Second-Order  Dynam¬ 
ics  Learning  algorithm’s  capability  to  approximate  higher-order  systems,  or  investigate  alternate  algorithms 
for  approximating  higher-order  dynamics. 
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Figure  8.  V2  after  10,000  Episodes 


Figure  9.  Natural  Frequency  Value  History 
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A  Constructive  Stabilization  Approach  for  Open-Loop  Unstable 

Non-Affine  Systems* 


Anshu  Narang-Siddarth1  and  John  Valasek2 


Abstract — This  paper  focuses  on  the  stabilization  of  non- 
affine-in-control  systems  that  are  open-loop  unstable.  The  main 
result  of  the  paper  is  a  general  method  for  constructing 
feedback  stabilization  of  all  non-affine  systems.  The  synthesis 
procedure  is  based  on  concepts  of  feedback  passivation,  and  is 
extended  for  non-affine  systems  by  deriving  sufficient  conditions 
for  passivity.  The  developments  and  essential  ideas  of  the 
proposed  technique  are  validated  via  simulation. 

I.  INTRODUCTION 

Consider  the  core  problem  of  developing  stabilizing  con¬ 
trollers  for  non-affine  systems  of  the  form 

E  :  x  =  f(x,  u)  (1) 

where  x  E  Mn  is  the  state,  u  G  Mm  is  the  control  input,  and 
f  :  Mn  x  Mm  — >  Rn  is  a  sufficiently  smooth  vector  field. 
Assume  that 

(Al)  The  unforced  dynamics  of  E  in  (1),  namely  x  = 
f(x,  0)  =  fo(x)  is  unstable. 

In  general  the  function  f  (x,  u)  is  not  monotonic  in  the  con¬ 
trol  and  may  be  singular  at  the  origin.  Thus  E  cannot  be 
stabilized  using  either  dynamic-inversion  [1]  or  the  modeling 
error  compensation  technique  [2].  Furthermore,  non-affine 
systems  of  the  form  given  in  (1)  cannot  be  stabilized  using  a 
fixed-gain  static  compensator.  To  see  this  behaviour  consider 
the  system 

x  =  x  —  2  u3.  (2) 

It  is  open-loop  unstable  and  satisfies  Assumption  (Al). 
Suppose  the  control  takes  the  form  u  =  K(t)x  in  (2).  Then 
the  resulting  closed-loop  dynamics  become  x  =  x  —  2 Ksx3 
that  has  the  following  equilibrium  solutions: 

0  for  all  K 

±7=m  for^>°-  (3) 

The  bifurcation  map  (Fig.  1)  illustrates  that  the  origin  re¬ 
mains  unstable  except  for  K  =  oo.  Furthermore,  the  system 
has  three  equilibrium  solutions  for  positive  values  of  the 
feedback  gain  that  converge  to  the  origin  as  K  —>  oo.  This 
behaviour  indicates  that  only  an  infinite  control  effort  can 
stabilize  the  origin. 

*This  work  was  supported  in  part  by  the  U.S.  Air  Force  Office  of  Sci¬ 
entific  Research  under  contract  FA9550-08-1-0038  with  technical  monitor 
Dr.  Fariba  Fahroo. 

1A.  Siddarth  is  a  Post-Doctoral  Research  Associate  in  the  Ve¬ 
hicle  Systems  &  Control  Laboratory,  Aerospace  Engineering  Depart¬ 
ment,  Texas  A&M  University,  College  Station,  TX  77843-3141,  USA 
anshunl@tamu . edu 

2J.  Valasek  is  Professor  and  Director,  Vehicle  Systems  &  Control  Labo¬ 
ratory,  Aerospace  Engineering  Department,  Texas  A&M  University,  College 
Station,  TX  77843-3141,  USA  valasek@tamu  .  edu 


Fig.  1.  Stable  (solid  lines)  and  unstable  (broken  line)  equilibrium  solutions 
of  (2)  with  u  =  Kx 


An  alternative  solution  to  regulate  (2)  is  to  switch  feedback 
gains  in  accordance  with  the  current  state.  As  an  example, 
suppose  the  feedback  gain  for  (2)  was  initialized  to  K(t  = 
0)  =  1.5.  Then  from  (3)  the  state  would  stabilize  to  either 
X steady  =  -0.384  or  xsteady  =  0.384  depending  on  its 
initial  condition.  Next,  if  the  feedback  gain  is  switched  to 
K(t  >  t steady)  =  2  the  state  would  stabilize  at  xsteady  = 
±0.25.  Hence,  increasing  the  gain  successively  the  origin  can 
be  stabilized  through  a  finite  control  input.  The  fundamental 
problem  with  this  switching  scheme  is  that  analytic  deter¬ 
mination  of  the  switching  curves  for  general  systems  of  the 
form  (1)  requires  substantial  system  knowledge  and  offline 
processing.  Additionally,  the  switching  conditions  and  the 
number  of  control  switches  depend  on  the  initial  condition 
and  the  control  form,  leading  to  a  system  specific  design  [3], 

[4]. 

In  this  paper  the  construction  of  an  analytic  state-feedback 
control  law  is  pursued.  The  major  contribution  of  the  paper  is 
a  theoretical  result  which  shows  that  under  mild  conditions 
a  control  law  of  the  form  u(x)  =  a(x)  ±  i/(x)  globally 
stabilizes  a  large  class  of  non-affine  systems.  The  function 
a(x)  converts  an  open-loop  unstable  system  into  a  stable 
system  in  the  Lyapunov  sense,  and  i/(x)  is  constructed  to 
bring  about  the  necessary  energy  dissipation  for  globally 
stabilizing  the  origin.  The  design  procedure  presented  here 
is  based  on  the  ideas  of  feedback  passivation  introduced 
in  [5]  for  control-affine  systems.  The  general  concept  is 
to  use  state-feedback  to  render  the  system  passive  and 
then  employ  well-established  results  for  stabilizing  passive 
systems.  Toward  this  end,  the  fundamental  question  to  be 
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answered  is  when  is  a  general  nonlinear  system  passive? 
The  well-known  Kalman- Yacubovitch-Popov  lemma  [6]  and 
its  nonlinear  counterparts  derived  by  Hill  and  Moylan  [7] 
answer  this  question  for  linear  and  affine-in-control  systems 
respectively.  Sufficient  conditions  for  passivity  of  non-affine 
systems  and  their  relationship  with  the  existing  necessary 
conditions  [8]  are  derived  for  the  first  time  in  this  paper, 
in  Section  II.  The  main  result  for  stabilization  of  general 
multiple-input  systems  posed  as  sufficiency  conditions  is 
derived  in  Section  III.  The  theoretical  results  are  verified 
with  simulation  examples  in  Section  IV.  Closing  remarks 
are  discussed  in  Section  V. 


II.  Passive  Systems 


In  this  section  the  sufficiency  conditions  under  which 
a  nonlinear  system  can  be  considered  passive  are  derived. 
Consider 


x  =  f(x,u) 
y  =  h(x,  u) 


(4) 


with  state- space  X  =  Rn,  set  of  input  values  V  —  Rm,  and 
set  of  output  values  Y  =  Rm.  The  set  U  of  admissible  inputs 
consists  of  all  [/-valued  piece-wise  continuous  functions 
defined  on  R.  The  functions  f(.)  and  h(.)  are  continuously 
differentiable  maps  defined  on  the  open  subset  O  C  Rn. 
It  is  assumed  that  these  vector  fields  are  smooth  mappings, 
with  at  least  one  equilibrium.  Without  loss  of  generality  the 
origin  is  chosen  as  the  equilibrium  of  £i,  that  is,  f(0,  0)  =  0 
and  h(0,  0)  =  0.  In  order  to  derive  conditions  for  £i  to  be 
passive  two  important  definitions  are  reviewed  and  presented 
below. 


Definition  II.1.  The  system  £i  is  said  to  be  passive  if  there 
exists  a  positive  semi-definite  storage  function  U(x)  that 
satisfies  U  (0)  =  0  and  for  any  u  G  U  and  initial  condition 
x0  G  X 

V(x)-V(x0)<  [  yT(s)u(s)ds.  (5) 

Jo 

If  the  storage  function  is  Cr  times  continuously  differentiable 
with  r  >  1  then  (5)  is  equivalent  to 

v  <  yTu.  (6) 


Definition  II.  1  is  the  mathematical  analog  of  stating  that 
the  amount  of  energy  stored  in  a  passive  system  is  less  than 
or  equal  to  the  energy  being  input.  For  convenience  define 
the  vector  fields 


f0(x)  =  f(x,  0)  G  Rn  (7a) 

h0(x)  =  h(x,  0)  G  Rm  (7b) 

where  fo(x)  represents  the  open-loop  dynamics  of  £i  while 
h0(x)  is  the  output  of  £i  at  zero-input.  Using  (7)  and  the  fact 
that  the  vector  fields  in  £i  are  smooth,  (4)  is  equivalently 
represented  as 

X  =  f0(x) +g(x,u)u  (8a) 

h(x,  u)  =  h0(x)  +  j(x,  u)u  (8b) 


where  the  following  identities  have  been  used: 


f(x,u)  -  fo(x)  = 
h(x,  u)  -  h0(x) 


(  (x,7) 

dO) 

\J  o  dy 

7=6>u  J 

(  f 1  9h(x,7) 

dO  ) 

[Jo  d~i 

7=0u  J 

u(x)  =  g ( x .  ii ) ii  (9) 
u(x)  =  j(x,  u)u. 

(10) 


The  vector  fields  g(x,  u)  and  j(x,  u)  defined  above  capture 
the  effect  of  the  control  input  on  the  motion  of  the  dynamical 
system  states  and  the  output.  Notice  that  for  control-affine 
systems  these  vector  fields  will  be  independent  of  the  control 
input  vector.  Using  smoothness  of  the  vector  g(x,  u),  (8a) 
can  be  further  decomposed  as 

m 

x  =  f0(x)  +  go(x)u  +  [Rj(x,  u)u]  (11) 

i= 1 


with  R^(x,  u)  :  Rn  x  Rm  — >  Rnxm,  being  a  smooth  map 
for  1  <  i  <  m  and 

g?(x)  =  gi(x,  0)  =  J^(x,  0)  £  Rra;  1  <i<m  (12a) 

f)  f 

S°(X)  =  5u(X’0)  =  (12b) 

The  vector  field  g?(x)  defines  the  influence  of  the  input  Ui 
on  the  system  about  the  origin  and  is  collected  for  all  inputs 
under  the  vector  go(x). 

The  next  definition  gives  the  necessary  conditions  for 
an  input/output  nonlinear  system  £1  to  be  passive.  In  the 
following  and  the  rest  of  the  paper,  the  expression 

/  0V  \ 

£f°y=  \^,fo(x))  (13) 

represents  the  Lie  derivative  of  the  Cr(r  >  1)  functional 
V  :  Rn  — »  R  along  the  vector  field  fo(x).  Additionally,  the 

standard  notation  adh  q°  is  used  for  Lie  bracket. 

lo-i 

Definition  II.2.  [8].  Let  Ui  =  {x  G  Rn  :  Cf0V(x)  =  0}. 
Necessary  conditions  for  £i  to  be  passive  with  a  C 2  positive 
semi-definite  storage  function  V  are 

(i)  £f0  V  (x)  <  0, 

(ii)  CSoV(x)  =  hj(x)  Vx  G  Ui, 

(iii)  Er=i0(x,°)C^jT(x>O)+j(x.O)  VxeQi. 

where  /^(x,  u)  is  the  ith  component  of  the  vector  function 
f(x,  u). 


For  a  positive-definite  storage  function  property  (i)  is 
analogous  to  Lyapunov’s  condition  V  <  0  for  bounded 
stability.  The  other  conditions  in  Definition  II. 2  follow  di¬ 
rectly  from  Definition  II.  1  by  noticing  that  the  difference 
^f(x,  u)  —  hT(x,  u)u  attains  its  maximum  at  u  =  0  on 
the  set  Ui. 

The  following  theorem  completes  Definition  II.2  by  pre¬ 
senting  the  sufficiency  conditions  required  for  a  system  Li 
to  be  passive. 

Theorem  1.  Let  V  be  a  C1  positive  semi- definite  function.  A 
system  £i  is  passive  if  there  exists  some  functions  q  :  Rn  —> 
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Rk,  W  :  Rra  ->  Rkxm  and  H  :  Rra  x  Rm  ->  Rfexm,  for 
some  integer  k  such  that 

(i)  Cf0V{x)  = -\qT{x)q(x), 

(ii)  CSoV(x)  =  hp  (x)  -  qT(x)W(x), 

(iii)  \  [W(x)  +  H(x,  u)]T  [W(x)  +  ff(x,  u)] 

=  \  [j(x,u)T  +j(x,u)]  -£r(x,u)V, 

(iv)  VFT(x)i7(x,  u)  +  i7T(x,  u)VF(x)  is  positive -definite. 

In  the  conditions  above 

£r(x,u)^  =  [£ri(x,u)V,  •  •  •  ,£Rm(x,u)^]T  e  Mmxm 


Proof.  The  proof  follows  the  developments  given  in  [9]. 
Assuming  functions  g(x),  VF(x),  and  iT(x,  u)  exist,  then 
along  the  solutions  of  Hi 

V  <V  +  |  [W(x)u  +  q(x)JT  [W(x)u  +  <7(x)] 


+  2U 


WT(x)H(x,  u)  +  HT(x,  u)W(x) 


u 


+  -uJ  H 1  (x,  u).ff(x,u)u 


(14) 


through  property  (iv)  of  Theorem  1.  Rearrange  (14)  to  get 
V  <Ct0V  +  £g0Tu  +  ur/:R(X)U)Tu+  ^qT(x)q(x) 

+  i  [gr(x)PT(x)u  +  uTWT{x)q{x)]  (15) 

+  l  [W(x)  +  H (x,  u)f  [PT(x)  +  H(x,  u)] . 


Using  properties  (i)  through  (iii)  given  in  Theorem  1,  (15) 
becomes 


V  <  h^(x)u+  l-uT  [j(x,u)T+j(x,u)]  u 
<  yTu. 


(16) 


Thus,  comparing  (16)  with  (6)  it  is  concluded  that  Si  is 
passive  and  V (x)  is  the  storage  function.  This  completes  the 
proof.  □ 

Remark  1.  Notice  on  the  set  Ui  defined  in  Definition  II.  2 
that  properties  (i)  through  (iii)  of  Theorem  I  become  exactly 
the  necessary  conditions  for  passivity.  Thus,  Theorem  1  plays 
the  role  of  the  generalized  KYP  lemma  for  non-ajfine  systems 
on  the  set  Qi. 

For  an  affine  Si,  Theorem  1  becomes  exactly  the  nonlinear 
version  of  the  KYP  lemma  derived  by  Hill  and  Moylan  [7]. 
This  result  is  summarized  in  the  following  corollary. 

Corollary  1.  Let  V  be  a  C1  positive  semi-definite  function. 
A  system 


X  =  fo(x)  +  go(x)u 
y  =  h0(x)  +  j(x)u 


III.  Control  Synthesis  for  Stabilization 

This  section  addresses  control  design  for  general  non- 
affine  systems.  Suppose  the  control  is  decomposed  as 

u(x)  =  a(x)  +  i/(x)  (17) 

and  the  first  component  a(x)  is  used  to  ensure  the  non-affine 
system  under  consideration  is  passive  through  input  i/(x)  for 
some  dummy  output  y(x).  Then  asymptotic  stabilization  is 
guaranteed  through  pure  output  feedback  under  zero-state 
detectability  conditions  [6].  The  construction  and  proof  that 
the  choice  of  control  in  (17)  asymptotically  stabilizes  a  non- 
affine  system  is  the  focus  of  this  section. 

A.  Control  Synthesis  for  Multi-Input  Non-Ajfine  Systems 

This  result  is  an  algorithm  for  stabilizing  non-affine  sys¬ 
tems  of  the  form 


E:x  =  f(x,u);  x(0)  =  x0  (1) 


with  state-space  X  =  W1  and  set  of  input  values  U  =  Mm. 
The  set  U  of  admissible  inputs  consists  of  all  U -valued  piece- 
wise  continuous  functions  defined  on  M.  The  vector  field 
f(.)  is  a  continuously  differentiable  map  defined  on  the  open 
subset  O  C  Mn.  Without  loss  of  generality,  the  origin  is 
chosen  as  the  equilibrium  of  E.  The  control  algorithm  is 
detailed  in  the  following  four  steps. 

Step  1:  Define  vector  fields  for  the  system  under  study: 


/0(x)  =  f(x,a(x))eR" 

I  /  \\  f1  df  (x,a(x)  +7) 

2(x-‘'(x))  =  X - TR - 

5°=5i(x,0)GMn. 


dO 


(18) 


G  R 


nxrri 


Under  the  influence  of  the  control  in  (17)  and  the  definitions 
above,  E  becomes 


*  =  /0(x)+5(xMx)Mx)-  (19) 

Notice  that  the  vector  field  /  (x)  is  the  closed-loop  dynamics 
with  control  input  a(x),  unlike  /o(x)  defined  in  (7a). 
Further,  /  (x)  is  independent  of  i/(x)  in  (19).  This  allows 
separate  construction  of  a(x)  independent  of  i/(x)  and  is 
exploited  in  the  following  step. 

Step  2:  Construct  a(x)  to  ensure  /Q(x)  is  stable  in  the 
Lyapunov  sense.  This  step  ensures  the  energy  of  the  system 
remains  bounded.  Note  that  during  construction  of  a(x)  the 
control  function  i/(x)  =  0  identically. 

Step  3:  Define  a  dummy  output  h(x,  i/(x))  G  Mm  for  the 
system  in  (19)  to  ensure  that  it  becomes  passive  through 
the  input  iz(x),  or  equivalently,  satisfies  Definition  II.  1  or 
Theorem  1  with  a  positive-definite  storage  function. 

Step  4:  Finally,  construct  the  control  function  i/(x)  to 
satisfy 


is  passive  if  and  only  if 

(i)  CtoV  = -hf  {x)q(x), 

(ii)  £g0  V  =  h |  (x)  -  qT  (x)  W (x), 

(iii)  lWT{x)W{x)  =  \  [j(x)T  +  j(x)]. 


i/(x)  =  — h(x,  i/(x)).  (20) 

From  previous  work  on  stabilization  of  passive  systems  [8] 
it  is  known  that  if  the  dynamics  (19)  along  with  the  output 
definition  in  Step  3  is  zero-state  detectable,  then  the  control 
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given  in  (20)  globally  asymptotically  stabilizes  the  system. 
Zero- state  detectability  states  that  if  the  output  is  identically 
zero,  then  the  state  vector  approaches  the  origin  in  time. 
The  detectability  properties  of  E  can  be  verified  through 
accessibility  type  conditions  summarized  below. 

Definition  III.l.  [8]  Suppose  the  system  E  is  passive  with 
Cr(r  >  1)  storage  function  V,  which  is  positive  definite  and 
proper.  Then,  E  is  zero-state  detectable  ifClCiS  =  {0}  where 
the  distribution 

D  =  span  |adj  g3  :  0  <  k  <  n  —  1, 1  <  i  <  raj 

and  two  sets  Q  and  S,  associated  with  D  be  defined  as 

n  =  jx  e  N  C  Mn  :  C1}  V(x)  =  0,fc  =  1,.  ..,r}  ,  (21) 

S  =  {x  e  N  C  Mn  :  dj^CrVi*)  =  0,  Vr  6  D,  k  =  0, 1, . . . ,  r  -  l} 

(22) 

The  following  proposition  proves  that  E  described  in  (1) 
and  equivalently  in  (19)  is  asymptotically  stabilized  by  the 
proposed  control  algorithm. 

Proposition  1.  Suppose  V  is  a  C 2  positive -definite  Lyapunov 
function  and  the  functions  a(x)  and  i y(x)  satisfy  Steps  1-4 
with  output  h(x,  i/(x))  =  [CgV]T .  If  U  D  5  =  {0}  then  the 
control  u(x)  =  a(x)  +  i/(x)  asymptotically  stabilizes  the 
system  E. 

Proof  Asymptotic  stabilization  is  shown  using  LaSalle’s 
invariant  principle  and  Lyapunov’s  direct  method.  The  rate 
of  change  of  the  Lyapunov  function  about  the  trajectories  of 
E  given  in  (19)  is 

V  =Cf_V  +  CgVv(?f).  (23) 

Then,  through  construction  of  a(x) 

V  <  CgV i/(x).  (24) 

By  Definition  II.  1  and  Theorem  1  (24)  is  passive  with  the 

output  y  =  [CgV^j  .  With  =  {0},  this  passive  sys¬ 

tem  is  zero- state  detectable  and  E  is  asymptotically  stabilized 
by  input  i/(x)  =  —CgV .  This  completes  the  proof.  □ 


three  given  in  (2).  The  control  law  for  this  example  was 
developed  through  analytic  root  solving  techniques  in  [12]. 
Here  an  alternate  control  law  formulation  is  presented  to 
globally  stabilize  the  origin. 

1 )  Controller  Design:  The  feedback  control  of  the  form 
(17)  is  constructed  in  four  steps. 

Step  1:  The  vector  fields  corresponding  to  definitions  (18) 
for  the  system  given  in  (2)  are: 


lQ(x)  =  x  -  2a3  (x)  (25a) 

g  (x,  o(x))  =  —6a2(x)  —  6a(x)u(x)  —  2v2(x)  (25b) 

g°(x)  =  —  6a2  (x).  (25c) 


Step  2:  The  following  choice  for  function  a(x)  ensures 
Lyapunov  stability  of  fQ(x): 


a(x)  = 


Dx )  = 


if  \x\  >  1; 

1 

\/2 

if  —  1  <  x  <  0; 

0 

if  x  =  0  ; 

l 

§/2 

if  0  <  x  <  1. 

ove  the  dynamics  /  (#) 

x  —  x3 

if  \x\  >  1; 

x  +  1 

if  —  1  <  x  <  0; 

0 

if  x  =  0  ; 

x  —  1 

if  0  <  x  <  1. 

(26) 


(27) 


Note  that  fQ(x)  described  in  (27)  has  three  stable  fixed 
points:  x  =  —  1,  x  =  0,  and  x  =  1.  Thus  the  dynamics 
of  the  system  (2)  are  rendered  stable  for  all  time. 

Steps  3  &  4:  These  steps  construct  the  control  input  v(x) 
that  enforces  stability  of  the  origin.  Control  laws  for  such 
a  class  of  system  have  been  addressed  by  passivity-based 
methods.  Following  the  formulation  given  in  [10]  control 
input  v(x)  is  constructed  as 


v(x)  = 


■y(x)£goV(x) 
l  +  \CgOV{x)\2 


(28) 


Proposition  1  is  a  powerful  result  that  guarantees  asymp¬ 
totic  stabilization  for  all  non-affine  nonlinear  systems.  This 
method  of  control  synthesis  is  general  and  relies  upon 
separate  construction  of  stiffness  and  damping  functions 
a(x)  and  iz(x)  respectively.  The  construction  of  i/(x)  has 
received  considerable  attention  in  the  literature  under  the 
label  ‘passivity-based  control’.  The  requirements  of  zero- 
state  detectability  is  a  consequence  of  employing  pure  output 
feedback  for  passive  systems  [8],  [10],  [11]  which  can  be 
relaxed  by  use  of  other  methods  for  control  of  open-loop 
stable  systems. 


where  y(x)  =  1+a 2(1+4^36a2(x))2 ,  and  CgoV{x)  is  the  Lie 

derivative  of  V(x)  along  [0]g°(x)].  The  design  parameter 
0  <  (3  <  1  bounds  the  control  input. 

Proposition  1  guarantees  that  the  control  input  a{x)  + 
v(x)  asymptotically  stabilizes  an  open-loop  unstable  stable 
system  if  D  p|  S'  =  {0}.  A  routine  calculation  shows  that 
CfV{x)  =  0  for  Q  =  {—1,0, 1}.  Additionally, 

0  =  CgoV(x)  =  —  6xa2(x)  (29) 

0  =  %o,£°]^)  (3°) 


IV.  Numerical  Example:  One-Dimensional 
Non-Affine  Unstable  Dynamics 

The  purpose  of  this  section  is  to  verify  the  theoretical  de¬ 
velopments  through  an  open-loop  unstable  non-affine  system. 
The  example  considered  is  a  polynomial  system  of  degree 


is  satisfied  for  x  =  0  so  U  fj  S  =  {0}  for  all  xgM.  Hence  it 
can  be  concluded  that  the  control  form  a(x)  +  v(x)  globally 
asymptotically  stabilizes  the  origin.  Reference  [12]  designed 
u  =  %/x  as  the  control  law  for  the  prescribed  system  using 
inversion,  which  only  locally  regulates  the  system  (2). 
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Fig.  2.  System  response  of  (2)  for  u  =  0  and  u  =  a(x) 
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Fig.  3.  Closed-loop  system  response  of  (2)  and  control  effort 


2)  Results  and  Discussion:  The  proposed  control  law 
given  in  (26)  and  (28)  was  validated  in  simulation  with 
the  design  parameter  /3  set  to  0.9.  The  initial  condition  was 
chosen  as  x(0)  =  1.  The  behaviour  of  the  open-loop  system 
and  the  system  with  control  input  u  =  a(x)  is  presented  in 
Fig.  2.  As  expected  the  open-loop  behaviour  is  unstable  and 
the  system  with  u  =  a{pc)  stays  at  x  =  1  for  all  time. 
The  closed-loop  response  is  shown  in  Fig.  3.  The  initial 
magnitude  of  the  control  input  v{x)  is  small  (specifically 
v(x)  =  0.00029)  but  greater  than  zero  to  ensure  the  state  of 
the  system  becomes  less  than  1.  It  is  difficult  to  see  but  in 
the  figure  at  time  t  =  2 seconds  the  state  is  x{2)  =  0.993. 
The  control  is  dominated  by  a{pc)  since  the  dynamics  fQ(x) 
inherently  push  the  system  toward  the  origin.  By  construction 
in  (28)  the  magnitude  of  v(x)  increases  when  the  state  nears 
the  origin  so  as  to  asymptotically  regulate  the  dynamics.  This 
is  consistent  with  earlier  conclusions  that  high-gain  feedback 
is  required  to  stabilize  the  origin.  Thereafter  the  control  is 
turned  off  and  the  system  stays  at  the  origin  for  all  future 
time.  Note  that  the  discontinuous  nature  of  the  control  is  an 
artifact  of  the  choice  of  a(x). 

V.  Conclusions 

In  this  paper  a  design  procedure  for  analytic  construction 
of  control  laws  for  unstable  non-affine  systems  was  proposed, 
and  sufficiency  conditions  for  passivity  were  derived.  This 


work  also  extended  well-established  control  law  design  pro¬ 
cedures  for  stable  non-affine-in-control  systems  to  unstable 
non-affine  systems,  without  requiring  the  control  influence 
to  be  non- singular  throughout  the  domain  of  interest. 

The  proposed  control  laws  are  real-time  implementable 
and  unlike  some  switching  schemes  proposed  in  literature, 
do  not  require  immense  offline  processing.  The  algorithm  is 
general  and  can  stabilize  systems  of  the  form  x  =  x  —  2 xu4 
by  appropriate  design  of  a(x).  Owing  to  the  energy -based 
concept  that  is  utilized  for  construction  of  the  control, 
the  results  obtained  are  consistent  with  the  physics  of  the 
problem  and  do  not  violate  system  constraints.  Numerical 
examples  illustrate  that  the  nature  of  the  control  function 
a(x)  is  continuous  but  not  differentiable.  It  is  interesting 
to  note  that  [13] [Corollary  5.8.8]  proved  that  any  nonlinear 
system  whose  linear  counterparts  are  unstable  cannot  be 
locally  C1  stabilizable.  Importantly,  the  control  laws  derived 
here  arrive  at  this  well-known  result  without  making  any 
prior  assumptions  about  the  nature  of  the  vector  fields. 
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Necessary  Conditions  for  Feedback  Passivation  of  Nonaffine-in-Control  Systems* 
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Abstract 

It  is  well  understood  that  an  open-loop  Lyapunov  stable 
nonaffine- in-control  nonlinear  system  can  be  asymptotically 
stabilized  through  feedback.  But  stabilizing  an  open- loop 
unstable  nonaffine  system  remains  an  open  research  ques¬ 
tion.  This  paper  derives  the  necessary  conditions  required 
to  render  a  general  open-loop  unstable  nonlinear  system  pas¬ 
sive  through  static  feedback.  It  is  shown  that  this  is  possible 
only  if  the  system  under  consideration  has  relative  degree 
one  and  is  weakly  minimum  phase  through  an  appropriate 
output  definition.  Unlike  feedback  passivation  for  affine-in¬ 
control  nonlinear  systems  this  result  is  not  sufficient.  The 
developments  and  the  essential  ideas  of  the  paper  are  verified 
for  a  continuously  stirred  tank  reactor. 

1  Introduction. 

This  paper  revisits  the  problem  of  stabilizing  systems  of 
the  following  form: 

(1.1)  E  :  x  =  f(x,  u) 

where  x  E  Mn  is  the  state,  u  E  Mm  is  the  control  input 
and  f  :  Mn  x  Mm  -a  Mn  is  sufficiently  smooth.  Through¬ 
out  the  paper  it  is  assumed  that  a  control  Lyapunov 
function  for  (1.1)  exists  which  sufficiently  ensures  that 
the  dynamical  model  given  in  E  is  asymptotically  con¬ 
trollable  [1].  Balakrishnan  [2]  proved  that  any  such  con¬ 
trollable  nonlinear  system  could  be  transformed  into  the 
following  affine  form 

(1.2)  x  =  f(x,  u)  =  fi(x)  +  f2(x)u, 

which  inspired  many  of  the  of  nonlinear  control  tech¬ 
niques  that  we  know  today  such  as  feedback  lineariza¬ 
tion,  gain-scheduling,  sliding-mode  control,  backstep- 
ping  and  more  recently  forwarding.  However,  it  is  dif¬ 
ficult  to  find  a  change  of  coordinates  that  leads  to  the 
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linear  form  given  in  (1.2).  Moreover  if  such  a  trans¬ 
formation  exists,  the  resultant  set  of  coordinates  may 
be  abstract  mathematical  quantities  and/or  lead  to  dis¬ 
continuous  vector  fields  that  are  not  desirable  from  a 
control  standpoint. 

Artstein  [3]  proved  that  a  stabilizable  control  for 
continuous  time- invariant  nonaffine  systems  of  the  form 
(1.1)  exists  if  and  only  if  the  Lyapunov  function  U(x) 
satisfies 

(1.3)  inf  W(x)f(x,  u)  <  0. 

u 

The  intuitive  idea  behind  (1.3)  is  that  there  exists 
some  sort  of  ‘energy’  measure  of  the  states  that  dimin¬ 
ishes  along  suitably  chosen  paths  and  the  control  in¬ 
put  must  be  chosen  to  force  the  system  to  approach 
a  minimal-energy  configuration.  This  condition  is  a 
special  case  of  the  Hamilton- Jacobi-Bellman  equation 
[4][Ch.6,  Sec. 6. 3]  with  time- invariant  objective  function. 
It  is  well-known  that  this  partial  differential  equation 
may  not  always  have  a  solution.  Moreover,  if  a  solution 
exists  it  may  not  be  unique.  This  was  discussed  in  Art- 
stein’s  work  and  he  suggested  that  nonaffine  systems  in 
general  cannot  be  stabilized  with  continuous  feedback. 
Motivated  by  Artstein’s  conclusions,  Jayawardhana  [5] 
used  pulse-width  modulated  control  signals  to  stabilize 
non-interacting  mechanical  systems. 

The  fact  that  discontinuous  control  cannot  be  em¬ 
ployed  for  most  physical  systems  has  motivated  several 
researchers  to  explore  other  feedback  solution  methods 
for  nonaffine  systems.  Moulay  [6]  augmented  convexity 
requirement  on  the  argument  of  (1.3)  to  provide  suffi¬ 
ciency  conditions  for  the  existence  of  stabilizing  continu¬ 
ous  controls  and  developed  constructive  control  laws  for 
a  class  of  single-input  second  and  third  order  polynomial 
systems.  Lin  [7], [8]  explored  passivity-based  methods 
for  smooth  open-loop  Lyapunov  stable  nonaffine  sys¬ 
tems.  The  central  idea  of  the  passivity-based  approach 
is  to  take  advantage  of  the  smoothness  of  the  nonlin¬ 
ear  vector  field  and  represent  it  as  a  linear  combination 
of  affine  and  nonaffine  parts.  Upon  doing  so  the  con¬ 
troller  is  designed  by  assuming  that  the  affine  part  dom¬ 
inates  the  closed-loop  system  stability,  and  the  higher- 
order  terms  are  always  upper-bounded  for  all  admissible 
states  and  control  inputs. 

The  control  design  methods  discussed  so  far  pro- 


vide  constructive  forms  for  the  control  variable.  But 
in  order  to  consider  higher-order  unstable  systems  sev¬ 
eral  approximation  and  numerical  methods  have  been 
explored.  The  intuitive  idea  has  been  to  indirectly  sta¬ 
bilize  the  system  by  varying  the  control  derivative.  In 
order  to  do  so  the  nonaffine  problem  given  in  (1.1)  is 
augmented  with  the  control  input  dynamics  such  that 
the  resulting  dynamics 

(1.4a)  x  =  f(x,  u) 

(1.4b)  ru  =  v 

become  affine  in  the  input  vector  v.  The  time-constant 
r  is  appropriately  chosen  such  that  the  control  input 
dynamics  evolve  faster  than  the  dynamical  system  un¬ 
der  consideration.  Hovakimyan  [9]  designed  the  new 
input  vector  using  dynamic  inversion  motivated  by  the 
observation  that  for  a  single-state  single-input  system 
the  following  input  vector 

(1-5) 

V  =  -sgn  (f(x,u)  &ax) ;  a  >  0,  |7  >  0 

globally  asymptotically  stabilizes  the  system.  But 
(1.5)  assumes  non-singular  control  influence  which  is 
quite  restrictive  and  not  satisfied  in  general.  Similar 
assumptions  were  also  made  in  [10],  [11]  and  [12]. 

Motivated  by  Sontag’s  universal  formula  for  affine 
systems  [13]  one  is  lead  to  the  natural  question:  As¬ 
suming  that  a  control  Lyapunov  function  exists  for  the 
dynamic  system  given  in  (1.1).  Can  a  constructive  con¬ 
trol  law  design  procedure  be  formulated  to  asymptoti¬ 
cally  stabilize  an  unstable  nonaffine  system?  In  [14]  the 
authors  pursued  this  research  problem  and  presented 
a  control  design  procedure  based  on  feedback  passiva¬ 
tion  introduced  in  [15]  for  control- affine  systems.  The 
general  concept  was  to  use  state-feedback  to  render  the 
system  passive  and  then  employ  well-established  results 
for  stabilizing  passive  systems.  Reference  [14]  devel¬ 
oped  sufficiency  conditions  for  passivity  and  presented 
a  novel  method  for  construction  of  control  laws  without 
making  any  assumptions  about  the  nature  of  the  con¬ 
trol  influence.  In  this  paper,  the  important  conditions 
under  which  the  dynamical  model  given  in  (1.1)  can  be 
rendered  passive  through  state-feedback  are  derived. 

The  paper  is  organized  as  follows.  Section  2 
presents  the  necessary  mathematical  preliminaries  and 
conditions  under  which  (1.1)  can  be  rendered  passive 
are  derived  in  Section  3.  A  continuously  stirred  tank 
reactor  example  is  studied  in  Section  4  and  conclusions 
are  presented  in  Section  5. 


2  Preliminaries 

Consider  the  following  nonlinear  dynamical  system: 


(2.6) 


X  =  f(x,  u) 
y  =  h(x,  u) 


with  state-space  X  =  Mn,  set  of  input  values  U  = 
Mm  and  set  of  output  values  Y  =  Mm.  The  set  U 
of  admissible  inputs  consists  of  all  17-valued  piecewise 
continuous  functions  defined  on  M.  The  functions  f(.) 
and  h(.)  are  continuously  differentiable  maps  defined 
on  the  open  subset  O  C  Mn.  It  is  assumed  that  these 
vector  fields  are  smooth  mappings  with  at  least  one 
equilibrium.  Without  loss  of  generality,  the  origin  is 
chosen  as  the  equilibrium  of  Si,  that  is  f(0,0)  =  0  and 
h(0,  0)  =  0. 


Definition  2.1.  A  system  Si  is  said  to  be  passive 
if  there  exists  a  storage  function  V(x)  that  satisfies 
V (0)  =  0  and  for  any  u  G  U,  Vt  >  0  and  initial 
condition  x0  G  X 

(2.7)  V(x{t))  —  V(x0)  <  [  yT (s)u(s)ds. 

Jo 

If  the  storage  function  is  Cr  times  continuously  differen¬ 
tiable  with  r  >  1  then  differentiating  both  sides  of  (2.7) 
gives 

(2.8)  V  <  yTu. 


Definition  2.1  is  the  mathematical  analog  of  stating 
that  the  amount  of  energy  stored  in  a  passive  system 
is  less  than  or  equal  to  the  energy  being  input.  The 
next  definition  gives  the  necessary  conditions  for  an 
input/output  nonlinear  system  Si  to  be  passive.  For 
convenience,  define  the  following  vector  fields 

(2.9a)  fo(x)  =  f(x,  0)  G  Mn, 

(2.9b)  h0(x)  =  h(x,  0)  G  Mm. 

In  the  above  definitions  fo(x)  represents  the  open- loop 
dynamics  of  the  dynamical  system  Si  while  ho(x)  is 
the  output  of  Si  at  zero-input.  Using  these  introduced 
notations  and  the  fact  that  the  vector  fields  in  Si  are 
smooth,  the  nonlinear  dynamical  system  is  equivalently 
represented  as 


(2.10a)  x  =  f0(x)  +  g(x,  u)u 

(2.10b)  h(x,  u)  =  h0(x)  +  j(x,  u)u, 


where  the  following  identities  have  been  used: 

(2.11) 

f(x,  u)  -  f0(x)  =  (  [  ^  do]  u(x)  =  g(x,  u)u 

\Jo  T=0U  J 

(2.12) 


h(x,  u)  -  h0 


<9h(x,  7) 
d'y 


\~y=0u 


u(x)  =  j(x,  u) u. 


Hence  the  vector  fields  g(x,  u)  and  j(x,  u)  capture  the 
effect  of  the  control  input  on  the  motion  of  the  dynam¬ 
ical  system  states  and  the  output.  Recall,  for  control- 
affine  systems  these  vector  fields  are  independent  of  the 
control  input  vector.  Using  smoothness  of  the  vector 
g(x,  u),  (2.10a)  can  be  further  decomposed  as 

m 

(2.13)  x  =  f0(x)  +  g0(x)u  +  [R*(x,  u)u] 

i=  1 

with  Ri(x,  u)  :  Mn  x  Mm  ^  Mnxm,  being  a  smooth  map 
for  1  <  i  <  m  and 

(2.14) 

g!0(x)=g,(x,O)  =  ^(x,O)eRn;  1  <  i  <  m 

(2.15) 

/Of 

g°(x)  =  ^(x,°)  =  [gi(x)>--->gl(x)]  G  Rraxm 

The  vector  field  g?(x)  defines  the  influence  of  input  iq 
on  the  system  about  the  origin  and  is  collected  for  all 
inputs  under  the  vector  go(x). 

For  convenience  let  V  :  Mn  -A  M  be  a  Cr(r  >  1) 
storage  function  and  the  expression 

(2.16)  CfoV=(^M*)) 

represent  the  Lie  derivative  of  the  functional  V  along 
the  vector  field  fo(x). 

Definition  2.2.  [7].  Let  fR  = 

{x  G  Mn  :  £f0U(x)  =  0}.  Necessary  conditions  for 
Si  to  be  passive  with  a  C 2  storage  function  V  are 

(i)  £foF(x)  <  0, 

(ii)  £g0U(x)  =  hg  (x)  Vx  G  fli , 

(Hi)  EIU  Cp-(x>°)-£;  <  2jT(x,  0)  Vx  e  nx, 

where  /^(x,  u)  is  the  ith  component  of  the  vector  func¬ 
tion  f(x,  u). 

If  the  storage  function  was  positive-definite  property  (i) 
would  be  analogous  to  Lyapunov’s  condition  V  <  0 
for  bounded  stability.  The  other  conditions  in  Defi¬ 
nition  2.2  follow  directly  from  Definition  2.1  by  notic¬ 
ing  that  the  difference  (x,  u)  —  hT(x,  u)u  attains  its 
maximum  at  u  =  0  on  the  set  Di. 

3  Feedback  Equivalence  to  a  Passive 
System/Feedback  Passivation 

In  this  section  the  conditions  under  which  the  following 
system 

X  =  f(x,  u) 

y  =  h(x) 


is  feedback  equivalent  to  a  passive  system  with  positive 
definite  storage  function  V (x)  are  derived.  These  condi¬ 
tions  are  developed  to  exploit  the  following  interesting 
stabilizing  property  of  passive  systems.  Assume  that  £2 
is  passive  and  zero-state  observable.  This  means  that 
if  the  output  h(x)  =  0  is  zero,  then  the  state  is  iden¬ 
tically  zero.  With  this  property  the  following  theorem 
states  that  a  passive  system  is  globally  stabilized  purely 
by  output  feedback. 

Theorem  3.1.  [16][theorem  14-4]  If  ^2  is 

(i)  passive  with  a  radially  unbounded  positive  definite 
storage  function  and 

(ii)  zero-state  observable 

then  the  origin  x  =  0  can  be  globally  stabilized  by 
u  =  —  4>(y),  where  <fi  is  any  locally  Lipschitz  function 
such  that  0(0)  =  0  and  yT0(y)  >  0  for  all  y  0  0. 

The  control  in  Theorem  3.1  has  been  formulated  to 
ensure  the  passivity  condition  in  Definition  2.1  holds 
globally.  Then  the  zero-state  observable  property  helps 
conclude  that  the  origin  is  the  largest  invariant  set  and 
hence  the  global  equilibrium  of  the  closed-loop  system. 
In  order  to  use  this  powerful  result  for  control  design, 
conditions  under  which  systems  can  be  made  passive 
need  to  be  studied.  The  first  result  toward  this  end, 
studies  the  relative  degree  of  a  passive  system.  Relative 
degree  of  a  system  is  the  number  of  times  the  output 
must  be  differentiated  for  the  input  to  appear  explicitly. 
The  following  definition  expresses  this  condition  using 
Lie  derivatives. 

Definition  3.1.  The  system  £2  is  said  to  have  a  rela¬ 
tive  degree  (rq,  7*2, . . . ,  rm)  at  a  point  (xq,  uq)  if: 

(i)  ^  [Cfhi(x)]  =  0  for  all  1  <  i  <  m,  x  in  the 
neighbourhood  o/xq  and  all  u  in  the  neighbourhood 
of  uo  and  all  k  <  ri, 

(n)  m  [q*Mx)]|(x0,u0)  ^ 

Note  the  relative  degree  of  a  nonlinear  system  is  a 
local  concept  defined  about  the  point  (xo,uo)  and  also 
depends  on  the  domain  of  control.  This  dependence  is 
a  result  of  the  non-affinity  of  the  system.  Next  a  lemma 
is  derived  that  will  help  determine  the  relative  degree 
of  £2. 

Lemma  3.1.  The  origin  belongs  to  the  set  Ui  given  in 
Definition  2.2. 

Proof.  Consider  the  open- loop  system  £2-  The  neces¬ 
sary  condition  for  passivity  with  positive  definite  storage 
function  is 


(3.17) 


£f,V(x)  <  0. 


This  indicates  that  the  system  is  stable  in  the  Lyapunov 
sense.  By  Lasalle’s  theorem  [17]  it  is  known  that 
the  state  of  this  open-loop  system  will  enter  the  set 
{x  G  Mn  :  £f0V(x)  =  0}.  This  is  exactly  the  set  Qi 
in  Definition  2.2.  This  result  also  can  be  shown  by 
Barbalat’s  lemma  [18]. 

Further,  the  set  Qi  contains  the  invariant  sets  of 
the  system.  Since  origin  is  the  fixed-point  of  the  system 
Ei,  it  is  concluded  that  it  belongs  to  the  set  Cli.  This 
completes  the  proof.  □ 


Since  |^(0)  =  g^  (0)RT R(0)  is  assumed  to  be  full 
rank,  i7go(0)  has  full  rank.  Hence  it  is  concluded  that 
|^go(0)  is  m  x  m  and  full  rank.  This  completes  the 
proof.  □ 

Remark  3.1.  For  an  affine  system ,  the  conditions  of 
Definition  2.2  are  satisfied  for  all  control  inputs.  Since 
the  relative  degree  for  an  affine  system  does  not  depend 
on  input ,  Theorem  3.2  consequently  reduces  to  Proposi¬ 
tion  2.44  [19]. 


The  next  theorem  analyzes  the  relative  degree  of 
the  passive  system  E2. 

Theorem  3.2.  Suppose  S2  is  passive  with  a  C 2  storage 
function  V  which  is  positive  definite.  If  go(0)  and  |^(0) 
have  full  rank ,  then  S2  has  relative  degree  (1, 1, . . . ,  1) 
at  (x  =  0,  u  =  0). 


The  next  result  examines  the  nature  of  the  zero 
dynamics  of  S2. 

Theorem  3.3.  Suppose  S2  is  passive  with  a  C 2  storage 
function  V  which  is  positive  definite.  7/go(0)  and  §^(0) 
have  full  rank ,  then  zero  dynamics  of  S2  locally  exist 
about  (x  =  0,  u  =  0)  and  is  weakly  minimum  phase. 


Proof.  The  relative  degree  of  S2  is  one  if 
non-singular,  or 


ay 

du 


(0,0)  is 


dy 

du 


(0,0) 


(3.18) 


go(x)  + 


d 

du 


dh 

dx 


m 

5>R<(x,  u) 

_i=  1 


dh 

^go(0) 

£goh(0) 


Proof.  From  Theorem  3.2,  S2  has  a  well-defined  relative 
degree  and  local  zero  dynamics  exist.  Let  the  set 
Q2  =  {x  G  Rn  :  h(x)  =  0}  define  the  points  on  the  zero- 

®ut  manifold.  By  definition  of  E2  this  set  contains 
origin.  By  Lemma  3.1  origin  is  also  contained  in 
the  set  Qi.  Thus,  in  order  to  study  the  local  nature 
of  the  zero  dynamics  about  the  origin,  only  those  state 
trajectories  that  fall  in  the  intersection  set  Q2  Pi  ^1  Reed 
to  be  considered.  On  these  set  of  points  properties  (i) 
through  (ii)  of  Definition  2.2  hold.  Hence, 


ismxm  and  non-singular.  Hence  conditions  for  which 
(3.18)  holds  true  need  to  be  determined.  This  is  carried 
out  in  the  following  two  steps. 

First,  since  E2  is  passive  it  satisfies  the  necessary 
conditions  given  in  Definition  2.2.  Further,  property 
(ii)  in  Definition  2.2  is  defined  only  for  set  f2]_.  From 
Lemma  3.1  it  is  known  that  origin  belongs  to  the  set  Di 
and  hence 


(3.19) 


d 

fa 


go(x) 


<9h 

fa 


go(x) 


is  satisfied  at  x  =  0.  Differentiating  and  using  the  fact 
that  |^(0)  =  0  in  (3.19) 

d2V  dh 

(3.20)  (0)^(0)go(0)  =  ^go(0). 

The  rest  of  the  proof  proceeds  similar  to  Proposi¬ 
tion  2.44  given  in  [19].  The  Hessian  §^r(0)  is  symmet¬ 
ric  positive  definite  by  properties  of  the  storage  function 
and  can  be  factored  as  RT R  with  some  matrix  R.  Then, 

dh 

goT(0)RTR(0)g0(0)  =  ^go(0). 


^  =  £f(x,u)R 

(3.22)  =4F  +  £goyu  +  uT£R(XiU)Fu 

=  ut£r(X)U)Fu. 

By  Definition  2.1,  for  passive  systems  V  <  yTu.  Fur¬ 
thermore,  this  condition  becomes  V  <  0  on  the  set 
^2  (P  ^i-  This  inference  along  with  condition  (3.22)  im¬ 
plies  that  the  origin  is  Lyapunov  stable  and  hence  zero 
dynamics  is  weakly  minimum  phase.  This  completes  the 
proof.  □ 

Theorems  3.2  and  3.3  together  give  the  necessary 
conditions  for  feedback  equivalence  to  a  passive  system. 
This  result  is  summarized  by  the  following  theorem. 

Theorem  3.4.  Suppose  go(0)  and  §^(0)  have  full 
rank.  The  necessary  conditions  for  transforming  E2  into 
a  passive  system  with  C 2  positive  definite  storage  func¬ 
tion  V  using  static  state-feedback  compensation  are: 

(i)  T12  has  relative  degree  {1,1,...,  1}  and 

(ii)  is  weakly  minimum  phase 


(3.21) 


Proof.  From  Theorem  3.2  and  Theorem  3.3  it  is  known 
that  the  resulting  system  will  have  relative  degree 
(1,1,...)  with  weakly  minimum  phase  zero  dynam¬ 
ics.  Further,  it  is  well  understood  that  relative  de¬ 
gree  and  zero  dynamics  are  invariant  under  static  feed¬ 
back  [20]  [Lemma  2.4].  Hence  the  conditions  in  the  proof 
follow.  □ 

Theorem  3.4  extends  the  powerful  feedback  equiva¬ 
lence  approach  to  general  nonlinear  systems.  It  provides 
necessary  conditions  for  a  system  to  be  made  passive  by 
feedback  under  mild  restrictions.  The  equivalent  theo¬ 
rem  for  affine  systems  derived  in  [15]  shows  that  Theo¬ 
rem  3.4  is  also  sufficient  for  feedback  passivity.  But  the 
topological  and  nonlinear  nature  of  nonafhne  systems 
hinders  this  result  to  be  sufficient. 


Figure  1:  Control  influence  of  the  continuously  stirred 
tank  reactor 


4  Application  to  Continuously  Stirred  Tank 
Reactor 

The  purpose  of  this  section  is  to  show  how  conditions 
given  by  Theorem  3.4  can  be  applied  to  test  whether  or 
not  a  system  can  be  rendered  passive  through  static- 
feedback.  The  nonafhne  system  under  consideration 
is  a  constant  volume  reactor  and  the  objective  is  to 
stabilize  the  system  about  its  equilibrium  by  adjusting 
the  coolant  how  rate.  The  system  [21]  is  represented  as 


the  origin  is  not  the  equilibrium  of  the  system  given 
in  (4.23).  The  equilibrium  solutions  are  obtained  by 
solving  the  following  transcendental  equations: 

(4.24a)  0  =  1  —  aq*  —  clqXi*  exp^-10  / x 2*^ 

(4.24b)  0  =  350  —  x 2*  +  cqaq*  exp^-10  . 

Rewrite  the  concentration  as  aq*  =  1/(1  + 

aoexp(-10  /^2*))  and  solve  for  roots  of 


(4.23a)  aq  =  1  —  x\  —  aoxi  exp^  10 

(4.23b)  x2  =  350  —  x2  +  eqaq  exp^-10  lx^ 

+  a3u(l  -  exp(-a2/n))(350  -  x2) 

and  where  0  <  aq  <  1  is  the  concentration  of  the  tank 

in  mo///,  x2  >  350  is  the  temperature  of  the  tank  in 

°K  and  u  >  0  is  the  coolant  how  rate  in  mol /min. 
The  system  parameters  [21]  are  given  in  Table  1.  The 
control  influence  in  (4.23)  is  nonlinear  in  the  control  and 
not  monotonic  in  any  variable.  This  trend  is  presented 
in  Fig.  1.  Owing  to  this  nonlinear  behaviour  previous 
studies  have  used  neural-network  based  control  designs 
to  stabilize  the  concentration  of  the  reactor  [21],  [22]. 

4.1  Test  for  Feedback  Passivation  The  first  step 
is  to  cast  the  system  into  form  of  £2.  However, 


Table  1:  Continuously  stirred  tank  reactor  model  pa¬ 
rameters 

Parameter  Value 

ao  7.2  x  10 wmin~1 
ai  1.44  x  1013 

a2  6.987  x  102 

&3  0.01 


(4.25) 

0  =  350  -  aq*  +  exp(-10  /^2*)[350a0  +  cq  -  a0aq*]. 

The  algebraic  equation  given  in  (4.25)  has  a  unique  root 
x2*  =  549.01257025° K.  Using  (4.24)  the  unique  root  for 
concentration  is  aq*  =  0.001128849277 mol /l.  Dehne 
the  states  e\  =  aq  —  aq*  and  e2  =  x2  —  aq*  to  shift 
the  equilibrium  to  origin.  Routine  calculation  gives  the 
following  system: 

(4.26a)  <q  =  c  —  e\  —  ao(aq*  +  ef)  exp^-10  /U2*+e2)) 
e2  =  d  -  e2  +  ai(ei  +  aq*)  exp^-10  /02*+e2)) 
(4.26b)  +  asu(l  —  exp^_a2/u^)(350  —  aq*  —  e2) 

(4.26c)  y  =  e2 

where  c  =  aoaq*  exp^-104/^2*)  and  d  = 
— cqaq*  exp(-10  /X2*^  and  a  dummy  output  has 
been  dehned.  Thus  comparing  (4.26)  with  £2  gives  the 
following  vector  held  dehnitions 

(4.27) 

f  /  x  c  —  e  1  -  a0(aq*  +  ef)  exp(_lo4/^2*+e2)) 

U  x  ;  d-  e2  +  ai(ei  +  aq*)  exp^_1°4/^2*+e2^ 


a3(350  -  aq*  —  e2)  J  ’ 


go(e)  = 


and 


(4.29)  h0(e)  =  e2. 

In  the  second  step  necessary  conditions  given  in 
Theorem  3.4  are  verified.  Notice  with  the  output 
defined  by  (4.29)  property  (i)  is  satisfied.  Furthermore, 
go(0)  =  [0, a3(350  —  X2*)]t  and  ^(0)  =  [0,1]  have 
full  rank.  Finally,  to  verify  that  the  (4.26)  is  weakly 
minimum  phase  set  the  output  y  =  e 2  =  0  and  its 
derivative  e<i  =  0.  This  leaves  the  following  internal 
dynamics 


ei  =  a0£i*  exp^  10  —ex  -  a0(x i*  +  ex)  exp 


(-101 2 3 4/x2* 


which  equivalently  becomes  a  linear  homogeneous  dif¬ 
ferential  equation 


e\  =  —ex  —  aoex  exp 


(-104/x2* 


From  Table  1  and  properties  of  exponential  function, 
it  can  be  concluded  that  the  internal  dynamics  given 
in  (4.31)  are  exponentially  stable  and  the  continuously 
stirred  tank  reactor  given  in  (4.23)  can  be  rendered 
passive  through  feedback. 


5  Conclusions 

In  this  paper  necessary  conditions  for  analytical  con¬ 
struction  of  control  for  unstable  nonaffine  systems  were 
derived.  This  work  extended  the  applicability  of  the 
well-established  feedback  passivation  control  law  design 
procedures  to  unstable  nonaffine  systems.  Furthermore, 
the  results  presented  do  not  require  the  control  influence 
to  be  non-singular  throughout  the  domain  of  interest. 
These  conditions  along  with  results  given  in  [8]  can  be 
employed  for  asymptotic  stabilization  of  a  general  non¬ 
affine  system  with  static  compensation  unlike  some  of 
the  switching  schemes  [23]  that  require  immense  off-line 
processing. 
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Flight  control  of  hypersonic  vehicles  is  challenging  because  of  the  wide  range  of  oper¬ 
ating  conditions  encountered  and  certain  aspects  unique  to  high  speed  flight.  A  particular 
safety  concern  in  hypersonic  flight  is  the  risk  of  an  inlet  unstart,  which  not  only  produces 
a  significant  decrease  in  thrust  but  also  results  in  a  change  to  the  aerodynamics  and  thus 
can  lead  to  the  loss  of  the  vehicle.  Previous  work  on  control  design  for  hypersonic  vehicles 
often  uses  linearized  or  simplified  nonlinear  dynamical  models  of  the  vehicle,  and  very  little 
work  has  been  done  on  recovering  from  unstart  events.  Using  a  generic  hypersonic  vehicle 
as  a  control  design  and  simulation  model,  this  paper  develops  a  nonlinear  adaptive  dynamic 
inversion  control  architecture  with  a  control  allocation  scheme  to  track  realistic  flight  path 
angle  trajectories.  A  robustness  analysis  is  performed  on  the  initial  control  architecture  de¬ 
sign,  which  shows  that  the  control  architecture  is  able  to  handle  time  delays,  perturbations 
in  stability  derivatives,  and  reduced  control  surface  effectiveness.  The  control  architecture 
then  is  evaluated  for  its  ability  to  handle  inlet  unstart.  Simulation  results  presented  in  the 
paper  demonstrate  that  the  approach  achieves  desired  tracking  performance  while  being 
robust  to  the  particular  uncertainties  and  inlet  unstart  conditions  studied. 

I.  Introduction 

The  design  of  control  architectures  for  hypersonic  vehicles  is  a  current  area  of  research  in  the  field  of 
controls.  One  particular  safety  concern  in  hypersonic  flight  is  the  risk  of  an  inlet  unstart,  which  can  lead  to  a 
decrease  in  thrust  and  the  possible  loss  of  the  vehicle.  There  are  three  main  reasons  that  cause  a  hypersonic 
airbreathing  engine  to  unstart:  (1)  a  flow  to  the  inlet  that  is  slower  than  the  required  Mach  number  for 
the  engine  to  operate,  (2)  an  altered  flow  that  no  longer  passes  through  the  throat  of  the  engine  because 
of  reasons  such  as  exceeding  the  limits  on  angle-of-attack  (a)  and  sideslip  angle  (/?),  and  (3)  an  increase 
in  the  back  pressure  in  the  engine  that  causes  the  shock  wave  to  move  ahead  of  the  throat  [1].  A  control 
architecture  for  a  hypersonic  vehicle  must  be  capable  of  maintaining  the  aircraft  on  a  controlled  trajectory 
in  the  event  of  an  inlet  unstart. 

Many  of  the  previous  control  designs  for  hypersonic  vehicles  have  involved  the  use  of  linearized  models 
of  the  aircraft  instead  of  the  full  nonlinear  equations  of  motion  [2],  [3],  [4],  [5].  Annaswamy,  et.al.  created 
adaptive  controllers  for  hypersonic  vehicles;  however,  the  controllers  are  designed  based  on  linearized  models 
of  the  aircraft  dynamics  and  require  gain-scheduling  for  their  implementation  [2], [3].  Groves,  et.al.  imple¬ 
mented  control  designs  based  on  linear  models  of  a  hypersonic  vehicle  for  setpoint  and  regulator  tracking  [4] . 
Bolender,  et.al.  designed  adaptive  control  laws  for  an  experimental  hypersonic  vehicle  based  on  a  linearized 
model  of  the  longitudinal  dynamics  of  the  vehicle  [5] . 

In  terms  of  work  with  nonlinear  models  for  control  design,  Johnson,  et.al.  applied  a  neural  network-based 
adaptive  control  architecture  to  a  model  of  the  X-33  vehicle  for  the  generation  of  ascent  and  abort  trajectories 
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as  well  as  the  control  of  the  aircraft  [6].  Fiorentini,  et.al.  [7]  and  Parker,  et.al.  [8]  both  used  simplified 
nonlinear  models  of  a  hypersonic  vehicle  in  their  control  design  that  exhibited  good  tracking  performance 
but  a  slow  response.  While  Parker,  et.al.  designed  an  approximate  feedback  linearization  controller,  the 
controller  in  that  paper  is  not  adaptive;  however,  a  case  study  of  their  approximate  feedback  linearization 
controller  showed  that  the  controller  was  robust  to  mild  plant  parameter  variations  in  the  moment  of  inertia 
I yy ,  the  vehicle  length,  and  the  mass  of  the  vehicle  [9]. 

This  paper  presents  a  design  of  a  nonlinear  adaptive  dynamic  inversion  controller  for  a  Generic  Hypersonic 
Vehicle  (GHV).  Because  the  dynamic  equations  for  the  GHV  are  inherently  nonlinear  and  the  aerodynamic 
and  control  derivatives  for  the  aircraft  have  significant  uncertainty  associated  with  them,  a  nonlinear  adaptive 
dynamic  inversion  control  architecture  was  selected  as  the  preferred  control  architecture.  The  design  of  the 
control  architecture  for  the  GHV  involved  the  nonlinear  equations  of  motion  for  the  vehicle.  When  the 
reference  trajectories  were  generated  for  the  simulation,  the  adaptive  dynamic  inversion  control  architecture 
was  altered  to  use  altitude  rate  h  instead  of  ct,  but  the  control  architecture  still  used  the  nonlinear  equation 
for  h  in  its  design. 

The  paper  is  organized  as  follows.  Section  II  provides  a  brief  overview  of  the  GHV  and  the  proposed 
control  architecture.  Section  III  contains  the  derivations  of  the  general  adaptive  dynamic  inversion  equations 
that  are  used  in  the  control  architecture.  Sections  IV  and  V  show  how  the  nonlinear  dynamic  equations 
for  the  GHV  are  transformed  into  the  form  given  in  Section  III  for  the  control  architecture.  Representative 
simulation  results  and  a  robustness  analysis  of  the  nonlinear  adaptive  dynamic  inversion  control  architecture 
are  given  in  Section  VI.  Section  VII  describes  the  development  of  equations  to  allow  realistic  trajectories  to 
be  generated  for  the  GHV  simulation,  and  the  simulation  results  for  these  realistic  trajectories  are  shown  in 
Section  VIII.  Finally,  the  conclusion  and  future  extensions  of  this  work  are  given  in  Section  IX. 

II.  Control  Structure  for  the  GHV 

The  Generic  Hypersonic  Vehicle  (GHV),  as  shown  in  Figure  1,  is  an  academic  hypersonic  aircraft  vehicle 
model  created  at  the  Air  Force  Research  Laboratory  as  a  platform  for  studying  control  laws.  The  GHV  plant 
simulation  is  implemented  using  a  Simulink  model  that  contains  the  nonlinear,  6-DOF  equations  of  motion 
for  an  inelastic  hypersonic  vehicle.  The  aerodynamic  and  thrust  forces  and  moments  acting  on  the  vehicle 
are  modeled  using  look-up  tables;  the  tables  for  the  aerodynamic  forces  and  moments  were  generated  based 
on  computational  fluid  dynamics  data  using  shock-expansion  methods  with  a  viscous  correction. 

Using  two  elevons  and  two  ruddervators,  it  is  desired  to  control  angle-of-attack  (ck),  sideslip  angle  (/?), 
and  aerodynamic  bank  angle  (p).  It  was  decided  to  command  the  aerodynamic  bank  angle  (p)  instead  of 
the  bank  attitude  angle  (0)  in  order  to  ensure  that  the  dynamic  inversion  is  singular  only  at  (3  =  ±90  deg. 
Figure  2  shows  a  diagram  of  the  GHV  system  with  the  adaptive  dynamic  inversion  controllers.  To  simplify 
the  process  of  designing  a  nonlinear  adaptive  dynamic  inversion  control  architecture,  it  is  assumed  that  the 
aircraft  states  can  be  divided  into  two  timescale  categories,  which  are  the  fast  states,  which  consist  of  the 
angular  rates  p,  g,  and  r  as  noted  in  [10],  and  the  slow  states,  which  consist  of  the  angles  a,  [3 ,  and  p.  An 
adaptive  dynamic  inversion  controller  first  must  translate  a,  /?,  and  p  commands  into  commands  for  the 
body  axis  rates  p,  g,  and  r,  which  then  are  passed  into  another  adaptive  dynamic  inversion  controller  that 
determines  the  corresponding  control  surface  deflections  to  achieve  the  desired  p,  g,  and  r  commands. 

The  following  three  sections  will  describe  the  equations  found  in  the  inversion  blocks  in  Figure  2.  See 
Reference  [11]  for  a  description  of  the  equations  that  are  contained  in  the  GHV  simulation.  For  the  equations 
derived  in  Sections  IV  and  V,  the  flat,  nonrotating  earth  assumption  [12,  p.  43]  is  made.  It  is  acceptable 
to  make  this  assumption  in  this  case  because  while  the  vehicle  is  flying  fast  enough  for  the  round  rotating 
Earth  effects  to  be  significant,  the  time  scale  of  the  controlled  dynamics  are  sufficiently  fast  to  neglect  them. 

III.  General  Adaptive  Dynamic  Inversion  Equations 

This  section  contains  the  derivation  of  the  control  laws  for  two  cases  of  the  adaptive  nonlinear  dynamic 
inversion  controller.  The  first  case  involves  dynamic  equations  containing  the  same  number  of  controls  and 
controlled  variables,  and  the  second  case  deals  with  dynamic  equations  with  a  greater  number  of  controls 
than  controlled  variables.  It  should  be  noted  in  both  cases,  the  general  nonlinear  equation  of  the  system  is 
assumed  to  be  affine  in  the  control,  which  is  reasonable  for  small  deflection  angles. 
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Figure  1.  The  Generic  Hypersonic  Vehicle  (GHV). 


Figure  2.  Diagram  of  the  nonlinear  adaptive  dynamic  inversion  control  architecture  for  the  GHV. 
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A.  Case  with  Equal  Number  of  Controls  and  Controlled  Variables 

Consider  a  general  nonlinear  equation  of  a  system  in  the  form 

x~f(x)+g(x)u  (1) 

where  x  G  Mn  is  the  state,  u  G  Mn  is  the  control,  and  f{pc)  :  W1  i— ►  W1  and  g{pc)  :  W1  i— ►  W1  are  locally 
Lipschitz  continuous.  It  is  assumed  that  g(x)  is  nonsingular  for  all  x  G  Mn.  Suppose  that  the  desired 
reference  dynamics  for  the  system  are  given  by 

Xm  =  Axm  +  Br  (2) 

where  xm  G  Mn  is  the  model  state,  r  G  Mn  is  a  bounded  reference  signal,  A  G  Mnxn  is  Hurwitz,  and 
B  G  Mnxn.  The  equation  for  the  error  between  the  reference  model  and  the  actual  system  is 


C  —  Xrn  X.  (3) 

Taking  the  time  derivative  of  equation  (3)  results  in 

e  =  xm-x  =  xrn-  f(x)  -  g{x)u.  (4) 

If  the  control  u  is  chosen  to  be 

u  =  [g{x)]~1[xm- f(x)+Ke-p]  (5) 

where  f(x)  :  Mn  i— >•  Mn  is  a  model  of  the  plant  dynamics,  K  G  Mnxn  such  that  K  =  KT  >  0  are  the  gains  on 
the  tracking  errors,  and  v  G  Mn  is  a  pseudo-control  signal,  then  substituting  equation  (5)  into  equation  (4) 
produces  the  error  dynamics 

e  =  —f(x)  +  f{x)  —  Ke  +  v.  (6) 

Defining  the  error  between  the  model  and  the  actual  system  as  A  =  f(x)  —  f(x),  equation  (6)  becomes 

e  =  —KeJr  A  +  z/.  (7) 

In  this  paper,  it  is  assumed  that  A  can  be  represented  in  the  form  A  =  WT /3(x;  d),  where  W  G  W>xn  is  a  set 
of  unknown  weights,  and  (3  Glpxl  is  a  set  of  known  basis  functions  composed  of  the  states  x  and_a  vector  d 
of  bounded  continuous  exogenous  inputs.  Using  this  representation  for  A,  v  is  chosen  to  be  v  =  — WT/3(x ;  d), 
where  W  G  Rpxn  is  an  estimate  of  the  weights.  With  these  definitions,  equation  (7)  can  be  written  as 

e  =  -Ke  -  WTf3(x ;  d)  (8) 

where  ID  =  W  —  ID  is  the  weight  estimation  error. 

The  stability  of  the  closed  loop  system  under  these  assumptions  can  be  established  using  a  candidate 
Lyapunov  function  of  the  form 

V  =  eTe  +  tr(WTYw~1W)  (9) 

where  T\y  GMpxp  with  Tvt/=Tv^T>0.  In  order  to  determine  the  adaptation  law  for  the  parameters  in  ID 

and  to  prove  that  the  error  between  the  states  of  the  actual  system  and  the  reference  model  will  converge, 

first,  the  derivative  of  equation  (9)  along  the  system  trajectories  is  taken,  which  gives  the  result 

_ 

V  =  2eTe  +  2tr(WTrw~1W  ).  (10) 

Substituting  equation  (8)  into  equation  (10)  produces 

_  _ 

V  = -2eTKe-2eTWT(3(x]d)  +  2tr(WTrw-1W  ).  (11) 

Applying  the  trace  identity  that  aTb  =  tr(baT ),  equation  (11)  is  determined  to  be 

_ 

V  =  -2 eTKe  +  2tr(WT (Tw^W  -  f3(x;  d)eT)).  (12) 
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Then,  by  choosing  W  as 


W  =  TwProj(W,  0(x ;  d)eT)  (13) 

where  Proj  represents  the  projection  operator,  which  is  used  to  maintain  specified  bounds  on  the  weights 
[13],  V  can  be  upper  bounded  as 

V  <  -2 eTKe  <  0  (14) 

which  implies  that  e  is  bounded.  Because  r  is  bounded  by  definition  above,  xm  is  bounded.  Since  e  and  xm 
are  bounded,  x  is  bounded.  Consequently,  /3(x;  d )  is  bounded  as  well.  In  order  to  use  Barbalat’s  lemma  [14] 
to  complete  the  proof,  the  second  derivative  of  equation  (9)  along  the  system  trajectories  is  taken,  which 
gives  the  result 

V  =  -4  eTKe.  (15) 

Substituting  equation  (8)  into  equation  (15)  produces 

V  =  -4 eTK(-Ke  -  WT f3(x;  d)).  (16) 

Because  e,  kb,  and  /3(x;d)  are  bounded  as  proved  above,  V  is  bounded,  and  therefore  V  is  uniformly 
continuous. 

Because  V  is  lower  bounded,  V  is  negative  semi- definite,  and  V  is  uniformly  continuous,  by  Barbalat’s 
lemma  V  —>  0  as  t  — >  oo,  and  thus  e  — >  0  as  t  — >  oo  as  desired. 

B.  Case  with  a  Greater  Number  of  Controls  Than  Controlled  Variables 

Specifically  for  the  GHV,  the  form  of  the  general  adaptive  dynamic  inversion  controller  in  the  previous 
subsection  applies  to  the  a,  /?,  and  fi  inversion  component,  in  which  the  number  of  inputs  to  the  system  (a, 
/?,  fi)  is  equal  to  the  number  of  outputs  (p,  g,  r).  However,  in  the  p,  q,  r  inversion  component,  the  number  of 
inputs  to  the  system  (p,  g,  r)  is  not  the  same  as  the  number  of  outputs  (£pr,  5/,/,  5t,r>  £/,/)•  The  fact  that  the 
number  of  outputs  is  greater  than  the  number  of  inputs  requires  a  control  allocation  scheme  to  be  integrated 
into  the  inversion  control  law.  To  frame  the  problem  in  general  terms,  consider  the  given  nonlinear  equation 
of  a  system  in  the  form 

X  =  f(x)  +  g(x)Au  (17) 

where  x  G  Mn  is  the  state,  u  G  Mm  is  the  control,  f(x)  :  Mn  i— »  Mn  and  g(x)  :  Mn  Mnxm  are  locally 
Lipschitz  continuous,  and  A  G  Mmxm  is  a  constant  unknown  positive  definite  matrix.  It  is  assumed  that 
g(x)  is  full  rank  for  all  x  G  Mn.  Suppose  that  the  desired  dynamics  of  the  closed  loop  system  are  given  by 

Xm,  —  T  Bv  (18) 

where  xm  G  Mn  is  the  model  state,  r  G  Mm  is  the  bounded  reference  signal,  A  G  Rnxn  is  Hurwitz,  and 
B  G  Mnxm. 


The  derivation  of  the  control  law  and  the  adaptive  laws,  including  one  for  the  unknown  control  effective¬ 
ness  matrix  A,  proceeds  similarly  to  the  derivation  in  Subsection  A.  The  equation  for  the  error  between  the 
reference  model  and  the  actual  system  is 


p>  -  rp  _  rp 

(19) 

Taking  the  time  derivative  of  equation  (19)  results  in 

e  =  xm-x  =  Xm-  f(x)  -  g(x)Au. 

(20) 

The  desired  final  form  for  e  is 

e  =  —Ke  —  WT (3(x;  d)  +  g(x)Au 

(21) 

which  is  the  same  as  the  final  form  for  e  in  Subsection  A,  except  for  the  final  term  g(x)Au.  With  the 
appropriate  choice  of  adaptive  law  for  A,  the  choice  of  the  above  final  form  for  e  will  allow  the  stability 
of  the  system  to  be  proven.  In  order  to  derive  this  desired  form  of  e,  first  the  term  g(x)Au  is  added  and 
subtracted  from  equation  (20),  where  A  G  Rmxm  is  an  estimate  of  the  control  effectiveness  matrix,  and  the 
error  equation  becomes 

e  =  Xm  —  f(x)  —  g{x)Au  +  g{x)Au  —  g{pc)Au.  (22) 
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Let  A  =  A  —  A,  which  is  the  estimation  error  of  the  control  effectiveness  matrix.  Then,  equation  (22) 
simplifies  to 

e  =  Xm  ~  f(x)  -  g{x)Au  +  g{x)Au.  (23) 

Because  of  the  fact  that  the  number  of  controls  is  greater  than  the  number  of  controlled  variables  in 
this  case,  there  sometimes  are  infinite  choices  for  u  at  any  instant  in  time.  In  order  to  determine  a  specific 
control  law  for  the  system,  a  constrained  optimization  problem  is  solved  in  which  the  cost  function  J  =  uTQu , 
where  Q  G  Mmxm  with  Q  =  QT  >  0,  will  be  minimized,  subject  to  the  constraint  g(x)Au  =  £,  which  must 
be  satisfied  at  all  times.  The  cost  function  is  chosen  to  be  J  =  uTQu  so  that  the  control  effort  will  be 
minimized,  which  consequently  can  be  used  to  reduce  the  amount  of  trim  drag  during  flight,  it  is  assumed 
by  this  formulation  of  the  problem  that  the  control  surfaces  do  not  have  position  limits,  and  as  a  result, 
sufficient  control  power  will  always  exist.  By  choosing  the  term  £  in  the  constraint  equation  to  be 

£  =  Xm  ~  f(x)  +  Ke  —  v  (24) 

where  f(x)  :  Mn  i— ►  Mn  is  an  estimate  of  the  plant  dynamics,  K  G  Mnxn  with  K  =  KT  >  0  contains  the 
gains  on  the  errors,  and  v  G  Mn  is  a  pseudo-control  signal,  the  constraint  g(x)Au  =  £  will  ensure  that  when 
the  derived  control  law  for  this  second  case  is  substituted  into  the  equation  for  e,  and  the  equation  for  the 
error  dynamics  is  simplified,  the  first  two  terms  of  equation  (21)  will  appear  in  the  resulting  equation  for  e  as 
desired.  For  simplicity  in  the  control  law  derivation,  equation  (24)  will  not  be  substituted  into  the  constraint 
equation  at  the  present  time. 

To  derive  the  control  law,  first  the  constraint  must  be  included  in  the  cost  function  to  form  the  augmented 
cost  function 

J  =  utQu  +  XT  (g(x)Au  —  £)  (25) 

where  A  G  Mn  is  a  Lagrange  multiplier.  The  necessary  conditions  for  minimizing  J  are  given  by 

BJ  - 

—  =  g(x)Au  -£  =  0 

=  2  Qu  +  ATgT(x)X  =  0. 
ou 

Rearranging  terms  in  equation  (27)  results  in 

u  =  -^Q~1ATgT(x)X. 

Substituting  equation  (28)  into  equation  (26)  and  solving  for  A  produces  the  equation 

A  =  —  2(g(x)AQ~1ATgT(x))~1£. 

Finally,  substituting  equation  (29)  back  into  equation  (28)  results  in  the  control  law 

u  =  Q~1ATgT(x)(g(x)AQ~1ATgT(x))~1t. 

In  order  for  the  control  law  given  in  equation  (30)  to  be  continuous,  Q  and  g(x)AQ~1ATgT(x)  must  be 
invertible.  The  projection  bounds  that  will  be  applied  in  the  adaptive  law  for  A  must  ensure  that  A  remains 
invertible.  It  should  be  noted  that  for  the  case  where  the  number  of  controls  equals  the  number  of  controlled 
variables,  the  control  solution  is  unique,  and  the  control  law  in  equation  (30)  simplifies  to 

u=[g(x)]~1[xm- f(x)  +  Ke-u]  (31) 

which  is  the  control  law  that  was  chosen  in  Subsection  A. 

Continuing  with  the  derivation  of  e,  let  A  =  f(x)  —  f(x).  Substituting  equation  (30),  equation  (24),  and 
into  equation  (23)  produces  the  equation 

e  =  —  Ke  +  A  +  v  +  g{pc)Au.  (32) 

Again,  assume  that  A  can  be  represented  in  the  form  A  =  WT (3(x;d),  where  W  G  Wxn  is  a  set  of 
unknown  weights,  and  [3  G  Mpxl  is  a  set  of  known  basis  functions  composed  of  the  states  x  and  a  vector  d 
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of  bounded  continuous  exogenous  inputs.  Also,  the  representation  for  v  is  chosen  to  be  v  =  —WT/3(x;d), 
where  W  G  Mpxn.  Then,  equation  (32)  can  be  written  as 

e  =  —  Ke  —  WT  /3(x;  d )  +  g(x)Au  (33) 

where  W  =  W  —  W,  which  is  the  weight  estimation  error. 

As  in  Subsection  A,  a  Lyapunov  analysis  needs  to  be  performed  in  order  to  determine  the  adaptive  laws 
for  A  and  W  and  to  prove  that  the  error  between  the  states  of  the  actual  system  and  the  reference  model 


will  converge.  Given  the  candidate  Lyapunov  function 

V  =  eTe  +  tr(WTVw^W)  +  £r(ATA^A)  (34) 

where  Yw  G  W>xp  with  Yw  =  r wT  >  0,  and  TA  G  Mmxm  with  TA  =  Tat  >  0,  the  derivative  of  equation 
(34)  along  the  system  trajectories  is  taken,  which  results  in  the  equation 

V  =  2eTe  +  2tr(WTTw~1W  )  +  2tr(ArA~1A  ).  (35) 

Substituting  equation  (33)  into  equation  (35)  produces 

_  _  _ 

V  =  -2 eTKe  -  2eTWT(3(x ;  d)  +  2eT g(x)A  +  2tr(WTTw~1W  )  +  2tr(ArA_1A  )  (36) 

and  by  applying  the  trace  identity  that  aTb  =  tr(baT)  to  equation  (36),  the  equation  for  V  becomes 

V  =  -2eTKe  +  2tr(WT{Tw-1W  -  0(x;  d)eT))  +  2tr(A(TK~1A  +ueTg(x))).  (37) 

Let  the  equation  for  W  in  this  case  be  the  same  as  equation  (13),  and  let  A  be 

A  =  TAProj(A,-ueTg{x)).  (38) 

In  this  case,  the  final  equation  for  V  is  is  upper  bounded  by 

V  <  -2 eTKe  <  0  (39) 


which  implies  that  e  is  bounded.  Since  r  is  bounded  by  definition  above  and  e  is  bounded,  xm  is  bounded, 
and  thus  x  is  bounded.  Consequently,  g{pc)  and  (3(x;d )  are  bounded  as  well.  In  order  to  use  Barbalat’s 
lemma  to  complete  the  proof,  the  second  derivative  of  equation  (34)  along  the  system  trajectories  is  taken, 
which  gives  the  result 

V  =  -4  eTKe.  (40) 

Substituting  equation  (33)  into  equation  (40)  produces 

V  =  -4 eTK(-Ke  -  WT (3(x;  d)  +  g{x)Au).  (41) 

It  should  be  noted  that  u  is  bounded  becausejdl  of  the  the  signals  found  in  u,  which  is  given  by  equations 
(30)  and  (24)  are  bounded.  Thus,  because  e,  W,  j3(x;d),  g(x),  A,  and  u  are  bounded  as  proved  above,  V  is 
bounded,  and  therefore  V  is  uniformly  continuous. 

Finally,  Barbalat’s  lemma  can  be  applied.  Because  V  is  lower  bounded,  V  is  negative  semi-definite,  and 
V  is  uniformly  continuous,  by  Barbalat’s  lemma  V  —>  0  as  t  — ►  oo,  and  thus  e  0  as  t  —>  oo  as  desired. 

IV.  P,  Q,  R  Inversion  Controller 

The  first  designed  controller  was  the  inversion  controller  for  the  angular  body  rates  of  the  GHV  since 
these  variables  are  linked  directly  to  the  control  surface  deflections,  which  control  the  vehicle.  The  reference 
inputs  to  the  controller  are  the  commanded  angular  body  rates  pc,  qc  and  rc,  and  the  output  states  of  the 
controller  are  the  control  surface  deflections  £/?r,  and  .  Therefore,  in  this  case,  equation  (17) 

represents  the  current  system.  In  order  for  the  adaptive  dynamic  inversion  controller  to  be  designed  for  the 
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angular  body  rates,  f(x)  and  g(x)  must  be  determined  from  the  general  nonlinear  equations  for  p,  g,  and  r, 
which  in  vector-matrix  form,  are 


where 


and 


[J] 


diOBJ 

dt 


+  WBJ  x  Jwb,I  =  Maero  +  Mt 

B 


[J] 


Jx  0  Jxz 

0  Jy  0 

Jxz  0  J z 


mb,  i 


p 

Q 

r 


Substituting  these  equations  into  equation  (42)  and  simplifying  produces  the  result 


(42) 


(43) 


(44) 


Jx  0 

0  Jy 


p 

i 

1 

iN* 

+ 

1 

_ i 

Q 

+ 

(Jx  -  Jz)pr  +  Jxz (p2  ~  r2) 

r 

JxzQT  "1“  (Jy  Jx)PQ 

=  Maero  +  Mr. 


Therefore,  the  nonlinear  equations  for  the  angular  body  accelerations  can  be  written  as 


p 

Jx 

0 

Jxz 

-1 

/ 

-Jxzpq  +  (Jz  -  Jy)qr 

Q 

= 

0 

Jy 

0 

- 

(Jx  -  Jz)pr  +  Jxz(p2  -  r2) 

+  Maero  +  Mt 

r 

_  Jxz 

0 

Jz  _ 

V 

1 

4- 

-i 

+ 

1 

£ 

i _ 

J 

(45) 


(46) 


After  having  determined  the  nonlinear  equations  for  the  angular  body  accelerations,  the  next  step  is  to 
write  those  equations  in  the  form  of  equation  (17).  In  order  to  accomplish  this  task,  the  terms  related  to  the 
control  surfaces,  which  will  form  g(x),  must  be  extracted  from  equation  (46).  The  control  surfaces  terms  are 
included  in  the  aerodynamic  moment  terms  M^,  which  are  modeled  as 


where 


and 


'LA' 

’  qSbCt' 

Maero  — 

ma 

= 

qScCm 

Na 

qSbCn 

^z,oaseune  i  z,surj  aces  i  r 

z 

Q 

Cm  =  Cm, baseline  T  ^Cm,sur faces  T  OT7 

Z  Vt 


[yniqQ.  ' 


(Cn 


Cn  —  Cnfiaseline  T  ^CUi)Sur faces  T  OT7 

z  Vt 

±Ci,Su 
-  ACi:st  J  ( 


^Ci^sur faCes  —  ^Ci^f  r  (M",  Cf,  /3,  ^/,r)  T  {Cl 

+  ACi^stjr(M,  <a,  /?,  5t,r)  +  x  C 


(47) 


(48) 


(49) 


for  i  =  t,  m,  n. 

As  seen  in  equation  (48),  the  moment  coefficients  are  comprised  of  three  parts.  The  baseline  term  is 
the  moment  coefficient  for  the  base  airframe,  while  the  second  and  third  terms  adjust  for  the  effects  on 
the  moment  coefficients  due  to  the  control  surfaces  and  damping,  respectively.  In  equation  (48),  the  first 
and  third  terms  do  not  depend  on  the  control  surfaces;  therefore,  those  two  terms  belong  to  the  f{x)  term 
in  equation  (17).  In  order  to  determine  g(x),  the  second  term  in  each  equation  in  equation  (48)  must  be 
examined  to  determine  what  portion  of  the  term  is  control-dependent  and  thus  belongs  in  g(x).  For  this 
particular  control  design  for  the  GHV,  it  is  assumed  that  a  linear  approximation  with  respect  to  the  control 
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surface  deflection  5  can  be  made  for  each  of  the  terms  in  equation  (49).  The  linear  approximation  can  be 
expressed  as 


A Cit6a  (M,  a,  f3,  S3)  =  Ci,8s  (M,  a,  / 3 ,  [<5S 

for  i  =  N,Y,  A,  £,  m,  n 
and  Ss  ^t,i' 


0]) 


da 


ids 


dS , 


M,a,/3  constant 


(50) 


In  this  paper,  it  is  assumed  that  all  interactions  between  each  control  surface  are  negligible,  which  at 
high  Mach  numbers  is  approximately  true.  Deflections  of  the  right  and  left  control  surfaces  will  generate 
summative  forces  and  moments  in  the  XZ-plane  of  symmetry  of  the  aircraft,  whereas  in  the  other  planes, 
the  deflections  will  generate  canceling  forces  and  moments.  In  equation  form,  for  both  the  flaps  and  the  tail 
control  surfaces,  the  relationships  between  right  and  left  elevon  deflections  are  expressed  as 


CW,(S/>r  =  CN,6ffi  — CV,<5/,r  =  CY,Sf,i 

CU,<S/>r  =  CA,6ffi  “ Cqd/,r  =  C£,6fti  (51) 

Cm,Sf^r  =  Cn,5 /,r  =  Cn,S  f, 

and  the  relationships  between  right  and  left  rudder  deflections  are  expressed  similarly  as 

Cnm,v  =  —Cym,t  =  ^Y,st:l 

=  cA,8t,i  ~c^8t,r  =  a, St,  (52) 

Cm,5t,r  =  ^m,St,i  —Cn,5t,r  =  Cn,5t,- 

Consequently,  in  equation  (50),  the  term  where  Ss  =  0  can  be  written  for  the  combined  effect  of  both  the 
right  and  left  control  surfaces  collectively  as 


a,sf(M,a,/3,  [5f,r 

II 

cT 

II 

o 

II 

I  2 Ci,SfJM,a,l3,  [Sf^r  =  0]) 

for  i  =  N,  A,m 

(53) 

1° 

for  i  =  Y,£,n 

a,st(M,a,/3,  [5t,r 

=  0, 5t,i  =  0])  =  ■ 

J4 Ci,Stir(M,a,p,  [Vr  =  0]) 

\ 

for  i  =  N,A,m 

(54) 

0 

for  i  =  Y,£,  n. 

Given  equations  (50),  (53),  and  (54),  equation  (49)  can  be  rewritten  for  i  =  m,n  as 


AC,= 


dCe6f 


dSf^r 

dCii6ti, 


A  *  ,  9Ce,sf 

f,r  +  - 

M,a,/3  constant 


AS 


dS, 


t,r 


A  S; 


M,a,/3  constant 


t,r 


dCitstA 


dS, 


M,a,/3  constant 

A  6t,i 

M,a,(3  constant 


A  Cm  =  2  Cm,SfjM,a,/3,  [Sf,r  =  0])  +  2Cm,stjM,  a,  (3,  [St,r  =  0]) 


dc. 


m,Sf 


Odf-r 

dCmdt,, 


AS 


dC . 


M,a,/3  constant 


f,r 


m,8 


A  Cn  = 


+ 


d5t)r 

dCn,sf, 


AS; 


t,r 


dSf)i 
dCm^8t  i 


AS 


M,a,/3  constant 


M,a,/3  constant 


dS 

dCn,8t,r 


dS; 


t,r 


dSu 

AX  i 

^f,r  +  ;}X 

M,a,(3  constant 

dCnrSt, 


A  & 


]t,l 


M,a,f3  constant 


A*,, 


A  5t,r  + 


M,a,/3  constant 


dS ; 


t,l 


constant 

A  6t,i. 

M,a,/3  constant 


(55) 


(56) 


(57) 


Since  the  first  two  terms  of  equation  (56)  are  for  fixed  values  of  Ss,  they  constitute  bias  terms  and  therefore 

dCifSe 

belong  in  the  f(x)  portion  of  equation  (17).  As  a  result,  only  the  terms  represented  by 


% 


in  equations  (55),  (56),  and  (57)  belong  in  the  g(x)  term  in  equation  (17). 


M,a,/3  constant 
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To  complete  the  analysis  of  the  terms  in  equation  (46),  the  effect  of  the  center  of  gravity  shift  must  be 
accounted  for  in  the  nonlinear  equations  for  the  angular  body  accelerations.  The  shift  of  a  set  of  moments 
from  a  given  reference  point  to  the  center  of  gravity  is  given  by  the  equation 


FI  eg  Maero  T eg / aero  ^  'aero 

i  T 


and  in  this  particular  simulation,  rcg/aero  is  defined  to  be 
similarly  to  Maero  in  equation  (47)  above,  which  means  that  F( 


Xcg  0  0 


.  In  the  simulation,  Fc 
aero  has  the  form 


(58) 

aero  is  calculated 


'XA' 

~-qSCA' 

F  — 

1  aero  — 

ya 

= 

qSCY 

Za 

- qSCN 

(59) 


Therefore,  given  equation  (59)  and  the  definition  of  rcg/aero ,  equation  (58)  can  be  written  as 


xcg 

~qSC. a 

FI Cg  —  Maero 

0 

X 

qSCY 

_  0  _ 

—qSCN 

FI Cg  —  FI aero 


0 


(60) 


(61) 


-qSCNxcg  | 

-qSCYxcgj 

where  Maero  is  defined  in  equation  (47).  It  should  be  noted  that  the  terms  Cn  and  Cy  in  equation  (61)  can 
be  written  like  the  moment  coefficients  in  equations  (55),  (56),  and  (57)  as 


ACn  =  2CW,d/;r(M,  ct,  /?,  [5f:r  =  0])  +  2Cjv,<jtjr(M,  a,  (3,  [5t,r  =  0]) 


+ 


+ 


dCNt6f 


85  f^r 
8CN,5t. 


A  *  ,  dCNjftl 

+  ft  c 

M,a,/3  constant  uuf^ 


M,a,/3  constant 


(62) 


A  Cy  = 


85  fr 

8Cy,sfj1 


A  5t,r  + 


8C ■ 


N,8t; 


M,a,{3  constant 


+  ' 


85  f,r 

8Cy  st 


85til 

dCy^sf  t 


AS, 


]t,i 


M,a,/3  constant 


't,r 


ASf,r  +  Ofs 

M,a,(3  constant  uuf^ 


A  St 


M,a,/3  constant 


05, 


A  sf,i 

M,a,/3  constant 

A<5t>; 

M,a,/3  constant 


(63) 


Similarly  to  the  moment  coefficients  as  shown  above,  since  the  first  two  terms  of  equation  (62)  are  for  fixed 
values  of  5S,  they  constitute  bias  terms  and  therefore  belong  in  the  f{pc)  portion  of  equation  (17).  As  a  result, 

dCi,5s 

in  equations  (62)  and  (63)  belong  in  the  g(x)  term  in 

M,a,/3  constant 


only  the  terms  represented  by 


85s 


equation  (17). 

Having  examined  all  of  the  terms  in  the  nonlinear  equations  for  the  angular  body  accelerations,  equation 
(46)  can  be  written  in  the  final  form  of  equation  (17)  as 


P 

J. X 

0 

Jxz 

-1 

/ 

1 

+ 

n 

■4? 

i 

5f,r 

A  r  i 

\ 

Q 

— 

0 

Jy 

0 

- 

(Jx  -  Jz)pr  +  Jxzip 2  -  r2) 

+  Mt  +  qSG  +  qSH 

0fl 

X 

r 

_  J xz 

0 

Jz  _ 

V 

Jxzqr  ( Jy  Jx)pq 

Of,r 

_5t,i_ 

/ 

(64) 
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where 


G  = 


baseline  T  2Vr 


Gui, baseline  T  ( GmqQ  T  Gm6iQi)  H~  2Crn^Sf  r{^f,r  —  0)  T  ‘2Grn^t  r{8t,r  —  0) 

Z  Vt 


2jG 'n,5 /,r(^  f->r  —  ®)%cg  2C N  ;£t r{$t,r  —  0)^c<7 

b( Cr , baseline  T  2Vt  (^nr^ ) 


and 


H  = 


;  dC, 

dd  f  r 


95f,i 


r  9CN 
L°9  95 


-9Cjj2 
' 95f,r 
\9Cn  _  _  9CY 
95f,r  Xc9d5f,r 


-  9Cj2 

" 95  f 


T  9Cn  \ 
cg  95 f, i  I 


( h9Cn_  _  T  9Cy_\ 
U  95 fj  Xc9  95f  tl) 


b_9C£_ 

h9C^ 

°95t,r 

°95ttl 

'r 9Cm, 

—  9Ccg 

9Cn\ 

(r9Cm 

'  _  9CCg 

9Cn 

C95t,r 

95t,r  ) 

\C 95 ttl 

95  ig 

\  9Cn 

9Cy  \ 

(b^L 

9Cy 

095t,r 

~  %cg 

95t,r  ) 

\  95t,i 

—  %cg 

95  tg 

(65) 


(66) 


It  should  be  noted  that  the  partial  derivatives  in  equation  (64)  are  taken  with  respect  to  a  constant  value 
of  M,  cq  and  f3  from  the  current  flight  condition  and  that  the  control  surface  bias  terms,  where  Ss  =  0,  are 
evaluated  at  a  constant  value  of  M,  a  and  /3  from  the  current  flight  condition  as  well. 

Given  equation  (64),  which  is  now  in  the  form  of  equation  (17),  the  adaptive  dynamic  inversion  controller 
can  be  constructed  using  equations  (13),  (24),  (30),  and  (38). 


V.  ay  /?,  fi  Inversion  Controller 


As  with  the  p,  q,  r  inversion  controller,  equations  for  d,  /?,  and  p  must  be  determined  in  order  for  the 
adaptive  dynamic  inversion  controller  to  be  constructed.  It  should  be  noted  that  for  this  section,  Sx  will 
represent  sin  (a;),  Cx  will  represent  cos(x),  and  Tx  will  represent  tan(x),  where  x  is  an  angle.  The  derivations 
for  a  and  /?  are  based  on  the  derivations  for  those  terms  on  pages  110-112  in  Reference  [15].  The  starting 
point  of  the  derivations  is  the  basic  force  equations  in  the  stability  axes  under  the  flat  Earth  assumption, 
which  are 

bVrel  =  (1  /m)FA,T  +  9~  Ub/e  X  Vrei.  (67) 

Taking  the  time  derivative  of  the  relative  velocity  in  the  wind  axes  instead  of  in  the  body  axes  and  converting 
the  right  hand  side  of  equation  (67)  to  the  wind  axes  produces  the  result 

mVr  =  FtCol-\-CXtC —  D  —  mgS ^  (68) 


m{WT  =  -FTCa+aTS(3  -  C  +  mg(CaSpSe  +  CpSpCg  -  S^SpCpCe)  -  mVT(pSa  -  rCa )  (69) 

maVrCp  =  —  FTSa+aT  —  L  +  mg(SaSo  +  C^CpCg)  +  mVr(—pSpCa  +  qCp  —  rSpSa )  (70) 

where  D,  L,  and  C  represent  drag,  lift,  and  cross-wind  force,  respectively,  in  the  wind  axes. 

In  order  to  simplify  equations  (69)  and  (70)  and  to  express  them  in  terms  of  /i,  which  is  one  of  the 
commanded  states,  the  gravity  terms  in  those  equations  are  transformed  using  relationships  given  by  the 
following  direction  cosine  matrices  from  Chapter  4  of  Reference  [12]  as 

Tw,h{p,  7>x)  =  —oi,P)TBiH((l>,  0,  i>)  (71) 


where  W  represents  the  wind  axes,  B  represents  the  body  axes,  and  H  represents  the  local  horizon  axes. 
Each  direction  cosine  matrix  has  the  general  form 


T2ti(9x,0y,9z) 


CoyCoz 

SexSeyCez-CexSez 

CexSeyCezFSexSez 


CeySez 

SexSeySezFCexCez 

CexSeySez-SexCez 


~Sev 

Se.Ce, 

CexC6v 


(72) 


as  shown  on  page  9  of  Reference  [16].  By  examining  the  elements  of  the  matrices  in  equation  (71),  the 
following  relationships  involving  p  and  7  were  determined  to  be 


Tw,h(  2, 3)  =  SjJjC1  =  CaSpSo  +  CgS^Co  —  S^SpCpCg 


(73) 
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(74) 


Tw,h(  3, 3)  =  C^C, y  =  SaSe  +  C^CpCe 

which  can  be  substituted  into  equations  (69)  and  (70)  in  the  gravity  terms. 

Additionally,  the  thrust  force  Ft  terms  are  converted  into  the  wind  frame  and  expressed  in  terms  of  the 


vector 


1  T 


Ft  Ft  Ft 


,  which  is  given  in  the  body  frame.  The  transformation  of  the  Ft  terms  results  in 


FtF'o.-Tcx.t  Fp 

'  cac0  s0  sac0 ' 

Ft 

-L  X 

FtCu+olt  F P 

= 

-cas0  c0  -sas0 

Ft 

1  y 

FtScx+cxt 

1 

o 

to 

1 

Ftz 

FTxCaCp  +  FTySp  +  FTzSaCp 
-Ftx CaSp  +  FTy Cp  -  FTz SaSp 
—FTxSa  +  FTzCa 


Finally,  the  forces  D,  C,  and  L  must  be  written  in  terms  of  the  corresponding  forces  in  the  stability  axes, 
which  can  be  calculated  directly  from  information  in  the  model,  as 


~D 

7 

tq 

o 

_ 1 

D~ 

C 

= 

o 

7 

to1 

1 

Ys 

L 

0  0  lj 

Ls 

DsCp  +  YsSp 
—DsSp  +  YsCp 

Ls 


It  is  assumed  that  the  Ds  terms  are  absorbed  into  the  thrust  terms  in  equation  (75). 

Substituting  equations  (73),  (74),  (75),  and  (76)  into  equations  (69)  and  (70)  gives  the  final  equations 
for  ft  and  a  to  be 

P  =  -h-  ((V  +  FTy)C0  +  mgS^C-f  -  FTxCaS0  -  FTzSaS0)  +  (pSa  -  rCa )  (77) 

TTl  Vt 


a  =  *  (-L,  +  mgC ^  -  FTx Sa  +  FTz Ca)  +  ( -pCaT0  +  q  -  rSaT0).  (78) 

TTl  Vt^/3 

Now,  the  equation  for  ft  can  be  derived  since  the  derivation  involves  the  results  given  in  equations  (77) 
and  (78).  Starting  from  equation  (57)  on  page  56  of  Reference  [12],  where,  for  this  document  ft  =  —a 
in  Reference  [12],  the  relationship  between  the  angular  body  accelerations  and  the  local  horizon  angular 
accelerations  are  expressed  as 


p-  0Sa 

cac0  -cas0  -sa 

"l  0  -,S’7  ’ 

ft 

q  —  a 

- 

S( 3  C0  0 

0  c „  s ^ 

7 

r  +  /3Ca_ 

sac0  . sas0  ca 

0  —S^  CijCj_ 

_x_ 

Taking  the  inverse  of  equation  (79),  the  equation  for  ft  is  determined  to  be 

ft  =  (p  -  ftSoft  ( CaCp  -  T7CaSpS^  -  T7SaC^)  +  {q-a)  (Sp  +  T^CpS^) 

+  (r  +  ftCoft  (SaCp  +  T^CaC^  -  T7SaSpSJ .  '  ^ 

Substituting  equations  (77)  and  (78)  into  equation  (80)  and  simplifying  gives  the  final  equation  for  /i,  which 
is 


V  = 


mVr 


Ls(Tp  +  T7SM)  +  (Ys  f  FTy)T7C^Cp  -  mgC1CllTp  +  (. FTxSa  -  F^C^S^  +  Tp) 
-  (Ftx C a  +  FTzSa)TC^sft\  +  pCa  sec (/?)  +  rSa  sec (/?). 


(81) 
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Finally,  equations  (77),  (78),  and  (81)  are  combined  together  in  vector- matrix  equation  form  as 


((Ys  +  FTy)Cg  +  mgS^C7  -  FTxCaSg  -  FTzSaSg ) 

mVrCp  ~  C TxSa  +  FTzCa) 

(ls(Tp  +  T^Sfj)  +  (Y,  +  FTy)T^C^C0  -  mgCyC^Tf} 

+  (FTx Sa  -  FTzCa)(T7S „  +  Tg)  -  (FTx C a  +  F^SJT^Sgj 

Sa  0  ~Ca  p 

+  -TpCa  1  -TpSa  q 

sec(/3)Ca  0  sec  (/3)Sa  r 


(82) 


where  p,  q,  and  r  are  the  desired  angular  body  rates.  It  should  be  noted  that  it  is  assumed  that  the  forces  due 
to  control  surface  deflections  are  negligible,  and  therefore,  the  force  terms  in  equation  (82)  are  approximated 
from  look-up  tables  for  the  force  and  moment  coefficients  at  points  where  the  control  surface  deflections  are 
equal  to  0.  Also,  it  is  assumed  that  for  the  desired  angular  body  rates  that  the  inner  loop  p,  q,  r  controller 
is  perfect,  which  means  that  the  desired  angular  rates  equal  the  commanded  angular  rates. 

Given  equation  (82),  which  is  now  in  the  form  of  equation  (1),  the  adaptive  dynamic  inversion  controller 
can  be  constructed  using  equations  (5)  and  (13). 


VI.  Robustness  Analysis 

Based  on  the  control  and  adaptive  laws  derived  in  the  previous  sections,  a  simulation  of  the  entire  GHV 
system  with  the  nonlinear  adaptive  nonlinear  dynamic  inversion  control  architecture  was  created  in  Simulink. 
In  order  to  make  the  simulation  more  realistic,  second-order  actuator  dynamics  with  damping  ratio  (  =  0.7 
and  natural  frequency  o;n  =  25  Hz  are  included  in  the  current  simulation,  and  position  and  rate  limits  are 
placed  on  the  control  surfaces  of  30  deg  and  100  deg/s,  respectively.  Additionally,  a  time  delay  of  0.03  s  is 
included  in  the  simulation;  however,  it  should  be  noted  that  the  simulation  can  tolerate  time  delays  of  up  to 
0.04  s  without  the  responses  becoming  significantly  oscillatory.  Commands  to  ct,  /?,  and  p  are  given  as  ramp 
signals  from  0  degrees  to  a  commanded  angle  in  fixed  time.  For  the  ct,  /?,  p  inversion  controller,  the  basis 

r  i  t 

function  /3(x;  d )  is  chosen  to  be  /3(x;  d)  =  c  a  (3  p  M  ,  where  c  is  a  constant  bias  term.  For  the  p,  g, 

r  i T 

r  inversion  controller,  the  basis  function  /3(x;d )  is  chosen  to  be  f3(x\d)  —  cpqrafdM  ,  where 
c  is  a  constant  bias  term. 

The  total  velocity  of  the  vehicle  is  controlled  using  a  PID  controller,  which  is  not  depicted  in  Figure  2. 
The  input  to  the  controller  is  the  commanded  total  velocity  of  the  GHV,  and  the  output  of  the  controller 
is  the  equivalence  ratio.  The  equivalence  ratio  is  the  fifth  control,  and  along  with  the  four  control  surfaces, 
completes  the  the  control  complement  for  the  vehicle.  Additionally,  a  saturation  limiter  has  been  added 
after  the  velocity  PID  controller  to  constrain  the  equivalence  ratio  to  be  between  0  and  1. 

In  the  derivation  of  the  adaptive  dynamic  controllers  in  Section  III,  a  reference  model  was  described.  The 
difference  between  this  reference  model  and  the  actual  system  dynamics  constitutes  the  tracking  error  of  the 
system.  In  order  to  determine  the  reference  states  of  the  system,  the  reference  signal  r  must  be  defined.  For 
the  cq  /?,  p  inversion  controller,  the  reference  signal  consists  of  the  commanded  values  of  a,  /?,  and  p.  For 
the  p,  g,  r  inversion  controller,  the  reference  signal  consists  of  the  commanded  angular  body  rates  from  the 
a,  /?,  p  inversion  controller.  Both  of  the  reference  models  have  the  general  form 

£i  0  0  x\  r]i  0  0  ri 

0  6  0  x2  +  0  r/2  0  r2  (83) 

0  0  £3_  _x3_  _  0  0  t]3_  _r3_ 

where  £1,  £2,  £3, 771, 772,  and  773  are  scalars  that  define  the  desired  time  constants  of  each  control  channel. 

The  open-loop  poles  of  the  linearized  dynamics  at  the  flight  condition  of  Mach  6  at  80,000  ft  for  both  the 
longitudinal  and  lateral-directional  states  are  shown  in  Figure  3.  It  should  be  noted  from  the  eigenvalues 
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(a)  Longitudinal.  (b)  Lateral/Directional. 

Figure  3.  Open-loop  poles  for  the  linearized  longitudinal  and  lateral/directional  dynamics. 


Table  1.  Eigenvalues  for  the  linearized  longitudinal  dynamics. 


Eigenvalue 

Damping  Ratio 

Natural  Frequency  (rad/s) 

-2.14 

1.00 

2.14 

-2.79  x  nr3 

1.00 

2.79  x  10“3 

i.25  x  icr3  ±o.mj 

-0.0113 

0.111 

1.96 

-1.00 

1.96 

Table  2.  Eigenvalues  for  the  linearized  lateral/directional  dynamics. 


Eigenvalue 

Damping  Ratio 

Natural  Frequency  (rad/s) 

-6.10 

1.00 

6.10 

2.22  x  KT16 

-1.00 

2.22  x  10“16 

0.088 

-1.00 

0.088 

5.96 

-1.00 

5.96 
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listed  in  Tables  1  and  2  that  both  the  longitudinal  and  lateral-directional  states  have  several  eigenvalues  in 
the  right-half  plane,  which  indicates  that  the  GHV  is  an  unstable  vehicle.  A  nonlinear  adaptive  dynamic 
inversion  controller  will  be  able  to  suppress  the  unstable  dynamics  and  replace  them  with  desired  dynamics 
for  the  aircraft. 

Figures  4  and  5  show  representative  simulation  results  with  the  nonlinear  adaptive  dynamic  inversion 
control  architecture  for  the  commands  a  =  ±2  deg,  (3  =  0  deg,  and  /i  =  70  deg.  The  responses  are  well- 
behaved,  and  the  control  architecture  is  able  to  achieve  the  desired  tracking  performance  without  excessive 
control  effort. 


P,  Q,  R  U,  V,  W 


Figure  4.  State  responses  for  the  commands  a  =  ±2  deg,  (3  =  0  deg,  and  =  70  deg. 

A  robustness  analysis  was  performed  via  simulation  on  the  designed  adaptive  nonlinear  dynamic  inversion 
control  architecture  from  the  previous  section  in  order  to  determine  the  maximum  tolerable  uncertainties 
in  selected  parameters  for  the  control  architecture.  Uncertainties  in  the  plant  examined  in  the  analysis 
include  the  additive  uncertainties  A Cma ,  A CUf3 ,  and  A Cm  and  multiplicative  gains  D  on  the  control  surface 
deflections,  given  in  terms  of  equations  as 


Cm  ~  Cmbaseline  +  A Cm^a 

(84) 

Cn  ~  CjihaseUne  T  A Cn^f3 

(85) 

Cm  =  C jyiba selin e  ^Cm 

(86) 

Cs  =  DCSo . 

(87) 

The  criteria  for  determining  the  bounds  on  the  uncertainties  is  that  the  states  must  not  demonstrate  oscil¬ 
latory  behavior. 

Tables  3  and  4  provide  the  maximum  and  minimum  values  of  the  additive  uncertainties  A Cma  and  A CUf3 
for  various  <a,  /?,  and  fi  commands.  It  should  be  noted  that  an  examination  of  the  maximum  and  minimum 
baseline  values  of  Cma  and  CUf3  show  that  these  values  are  on  the  order  of  10-4,  which  is  typical  of  a 
high-speed  vehicle.  The  maximum  and  minimum  values  for  A Cma  and  A CU(3  in  Tables  3  and  4  are  on  the 
order  of  10-4  —  10-3,  and  therefore,  the  control  architecture  is  able  to  withstand  considerable  uncertainties 
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Figure  5.  Control  and  adaptive  weight  responses  for  the  commands  a  =  zb 2  deg,  (3  =  0  deg,  and  ii  =  70  deg. 


in  Cma  and  CUf3  and  maintain  stable  tracking  flight.  Similar  results  for  the  additive  uncertainty  A Cm  can 
be  found  in  Table  5. 

Table  6  contains  the  minimum  allowable  multiplicative  gains  D  on  the  control  surface  deflections  for 
various  a,  (3 ,  and  fi  commands.  These  gains  represent  a  loss  of  control  effectiveness  for  one  or  more  of  the 
control  surfaces  on  the  GHV.  For  all  cases,  the  vehicle  is  able  to  tolerate  low  values  of  control  effectiveness, 
which  shows  that  the  control  architecture  is  robust  to  loss  of  control  effectiveness. 


Table  3.  Additive  uncertainty  A Cma  over  a  30  s  period  with  0.03  s  time  delay. 


a  (deg) 

P  (deg) 

A  (deg) 

max  ACm(x 
(deg-1) 

min  ACm(x 
(deg-1) 

5 

0 

0 

0.0005 

-0.0013 

5 

1 

20 

0.0003 

-0.0011 

It  should  be  noted  that  following  this  preliminary  analysis  of  the  nonlinear  adaptive  dynamic  inversion 
control  architecture,  pseudo-control  hedging  ([17], [18])  was  added  to  the  simulation  in  order  to  protect 
the  nonlinear  adaptive  dynamic  inversion  control  architecture  during  periods  of  control  surface  saturation. 
However,  for  most  of  the  cases  simulated  for  the  GHV  for  this  paper,  control  surface  saturation  was  not 
encountered. 


VII.  Reference  Trajectory  Generation 

While  the  tracking  of  <a,  /3,  and  fi  was  achieved  as  demonstrated  in  Section  VI,  it  was  desired  that  the 
GHV  have  the  ability  to  track  a  realistic  trajectory  instead  of  selected  commands.  In  order  to  control  flight 
path  angle  7  as  opposed  to  <a,  a  nonzero  setpoint  (NZSP)  controller  ([19],  [20])  was  designed  to  generate 
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Table  4.  Additive  uncertainty  A CUf3  over  a  30  s  period  with  0.03  s  time  delay. 


a  (deg) 

P  (deg) 

M  (deg) 

max  A  CUf3 
(deg-1) 

min  A  Cn/3 
(deg-1) 

0 

1 

0 

0.007 

-0.003 

5 

0 

20 

0.01 

-0.004 

5 

1 

20 

0.006 

-0.003 

Table  5.  Additive  uncertainty  A Cm  over  a  30  s  period  with  0.03  s  time  delay. 


a  (deg) 

P  (deg) 

M  (deg) 

max  A  Cm 

min  ACm 

5 

0 

0 

0.0005 

-0.003 

5 

1 

20 

0.0005 

-0.002 

Table  6.  Multiplicative  gains  D  on  control  surface  deflection  terms  over  a  30  s  period  with  0.03  s  time  delay. 


a  (deg) 

P  (deg) 

M  (deg) 

Dh,r 

^Sf,i 

DSt,r 

Dst,i 

5 

0 

0 

1 

0.14 

l 

1 

5 

0 

0 

1 

1 

l 

0.01 

5 

0 

0 

0.15 

0.15 

l 

1 

5 

0 

0 

1 

1 

0.15 

0.15 

5 

0 

20 

1 

0.31 

1 

1 

5 

0 

20 

1 

1 

1 

0.01 

5 

0 

20 

0.21 

0.21 

1 

1 

5 

0 

20 

1 

1 

0.30 

0.30 

5 

1 

20 

1 

0.42 

1 

1 

5 

1 

20 

1 

1 

1 

0.05 

5 

1 

20 

0.38 

0.38 

1 

1 

5 

1 

20 

1 

1 

0.38 

0.38 
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trajectories  for  the  GHV  to  follow.  The  NZSP  controller  requires  a  linear  model,  so  the  nonlinear  GHV  plant 
model  was  linearized  about  a  flight  condition  specified  by  the  Mach  number  and  altitude.  Assuming  that  the 
vehicle  remains  wings-level  during  its  flight  of  the  trajectory,  only  the  longitudinal  dynamics  model  will  be 

required  for  the  trajectory  generation.  For  the  NZSP  controller,  the  longitudinal  states  are 

l  T 

and  the  controls  are 


i  T 


u  9  q  a 


where  St  represents  the  equivalence  ratio  control,  and  Se  represents  the 

elevator  control,  expressed  in  terms  of  the  GHV  controls  as  Se  =  (£/>  +  £/,/)/ 2.  The  outputs  ym  to  be 
commanded  by  the  NZSP  controller  are  velocity  u  and  flight  path  angle  7,  which  can  be  expressed  in 
matrix- vector  form  as 


Vm 


u 

_7_ 

1  0 
0  -1 


St 

S„ 


(88) 


By  fitting  a  polynomial  to  the  trajectory  generated  for  7,  and  finding  the  derivative  of  that  polynomial, 
the  reference  model  for  7  is  completely  defined  for  the  GHV  simulation.  In  order  to  implement  the  7,  /?,  p 
inversion  controller,  the  dynamic  equation  for  7  must  be  derived  and  written  in  the  form  of  equation  (1)  as 


7  =  f(s)+g(s) 


(89) 


where  s  represents  the  states  in  the  GHV  simulation.  The  equation  for  7  is  derived  using  the  same  process 
that  was  applied  to  find  /i  in  Section  V.  Starting  from  equation  (79),  and  taking  its  inverse,  the  equation 
for  7  is  determined  to  be 

7  =  D(p  —  /3Sa)  +  E(q  —  a)  +  F(r  +  $Ca)  (90) 


where 


D  =  CaSpC^ 


SaS „ 


F  = 


E  =  CpC^ 

CjSgC^2  -  CaSa  (Sg2  +  1)  +  ( Sa2S  *  -  1)  Sg 


(91) 

(92) 

(93) 


CaSpSp  +  SaC 

Consider  the  equations  for  /3  and  a  in  equations  (77)  and  (78),  respectively  to  have  the  following  form 

P  =  fp  +  (P$a  ~  rCa )  (94) 


OL  =  fa  +  (- pCaTf3  +  q-  rSaTp )  (95) 

where  fp  and  fa  represent  the  terms  in  /?  and  b,  respectively,  that  do  not  depend  explicitly  on  the  angular 
rates  p,  g,  and  r.  Substituting  equations  (94)  and  (95)  into  equation  (90)  and  simplifying  the  expression 
gives  the  resulting  equation  for  7  that 


7  =  -DfpSa  -  Efa  +  FfpCa 

4  p(D  DSa2  +  ECaTg  +  FCaSa) 

+  q(  0) 

+  r  ( DCaSa  +  ESaT0  +  F-  FCa 2) . 


Note  that  there  is  no  dependence  on  q  in  the  equation  for  7.  Consequently,  when  the  dynamic  equations  for 
/?,  7,  and  fi  are  expressed  in  the  form 


'P 

P 

7 

=  f(s)+g{s ) 

Q 

A 

r 

(97) 
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where  s  represents  the  states  of  the  GHV,  the  resulting  expression  for  g(s)  is 


9(s) 


Sa  0  -Ca 

{D  -  DSa2  +  ECaTp  +  FCaSa)  0  (. DCaSa  +  ESaT0  +  F-  FCa 2) 

sec  (P)Ca  0  sec  (/3)Sa 


(98) 


As  can  be  seen  in  equation  (98),  g(s)  is  not  invertible,  which  causes  a  problem  with  the  computation  of  p,  g, 
and  r  in  the  new  7,  /?,  p  inversion  block.  If  g(s)  cannot  be  inverted,  then  the  commands  for  p,  g,  and  r  cannot 
be  determined  for  the  inversion  controller.  Therefore,  substituting  the  equation  for  7  for  the  equation  for  a 
in  the  cq  /?,  p  inversion  controller  in  order  to  track  a  trajectory  for  7  is  not  possible  for  the  GHV  simulation, 
and  another  method  of  including  the  7  trajectory  in  the  GHV  simulation  had  to  be  determined. 

In  order  to  allow  for  the  GHV  simulation  to  track  a  flight  path  angle  trajectory,  a  method  from  Reference 
[21]  was  applied  in  which  the  equation  for  ft,  where  ft  represents  the  altitude  of  the  aircraft,  is  written  in  the 
form 


h  =  fh(s)  +gh(s) 


P 

q 


(99) 


r 


Given  the  equation  for  ft 


h  =  V  ( CpCaSe  -  SpS+Ce  -  CpSaCpCe) , 


where  V  is  the  total  velocity  of  the  vehicle,  the  equation  for  ft  is  determined  to  be 


(100) 


ft  = 


b0V  +  bi$  +  b2a 


a0  a  1  a2 


P 

Q 


r 


(101) 


where 

ao  =  64 

aj  =  bsC(f)  +  b^SffyTo 

a2  =  b^C^To  —  bsS^ 

and 

b0  =  CpCaSo  -  SpS+Ce  -  CpSaC^Co 
bi  =  V  (SpCaSe  -  CpSpCe  +  SpSaC^Ce) 
b2  =  V  (~CpSaSe  -  CpCaCpC0) 
b3  =  V  (CpCaCe  +  SpS+Se  +  CpSaC^Se) 
bA  =  V  (-SpCpCe  +  CpSaS+Ce) . 

Because  h  has  a  nonzero  coefficient  for  g,  which  means  that  the  term  g(s)  in  equation  (97)  is  invertible, 
the  equation  for  h  can  replace  the  equation  for  a  in  equation  (82)  for  the  cq  /?,  p  inversion  controller.  The 
original  reference  trajectory  that  is  generated  for  7  using  the  NZSP  controller  can  be  converted  to  h  using 
the  relation  from  aircraft  kinematics  that  h  =  VS1.  Once  a  polynomial  is  fitted  to  the  new  trajectory  for 
h,  and  the  derivative  of  that  polynomial  is  determined,  the  reference  model  is  defined  for  h.  The  ft,  [3 ,  p 
inversion  controller  replaces  the  original  cq  /?,  p  inversion  controller  in  the  GHV  simulation,  and  now  desired 
trajectories  for  7  can  be  tracked. 


VIII.  Simulation  Results 

The  GHV  simulation  with  the  ft,  /?,  p  inversion  controller  was  implemented  in  Simulink.  A  trajectory 
for  7  was  generated  at  the  flight  condition  of  Mach  6  at  80,000  ft.  Figures  6,7,  and  8  present  representative 
simulation  results  for  the  flight  path  angle  trajectory  shown  in  Figure  6.  It  should  be  noted  that  in  this 
particular  simulation,  a  command  of  /3  =  —4°  also  is  given  to  the  GHV.  The  nonlinear  adaptive  dynamic 
inversion  control  architecture  achieves  the  desired  tracking  performance  for  the  generated  flight  path  angle 
trajectory  and  sideslip  angle. 
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A  simplified  inlet  unstart  model  was  added  to  test  the  ability  of  the  nonlinear  adaptive  dynamic  inversion 
control  architecture  to  handle  tracking  a  trajectory  during  an  inlet  unstart.  For  the  simulation,  an  inlet 
unstart  is  triggered  at  a  specified  time,  and  the  loss  of  thrust  and  changes  to  aerodynamic  parameters 
following  the  unstart  are  modeled  as  instantaneous  changes.  The  coefficient  of  the  axial  force  (Ca)  is 
increased  slightly,  and  the  coefficient  of  the  normal  force  (Cjv)  is  decreased  slightly.  Additive  variations  in 
Cma  and  CU(3  are  included  in  the  plant  through  the  equations 

Cm  =  Crnbaselirie  +  A  Crriaa  (102) 

Cn  =  Cnhaselirie  +  A  Cnpf3.  (103) 

Through  a  robustness  analysis,  it  was  determined  that  the  maximum  destabilizing  additive  variations  in  Cm(x 
and  CUf3  that  the  nonlinear  adaptive  dynamic  inversion  control  architecture  could  tolerate  were  A Crricx  = 
0.001  deg-1  and  A CUf3  =  —0.001  deg-1. 

Figures  9,  10,  and  11  show  the  results  for  the  GHV  simulation  during  flight  path  angle  tracking  with  an 
inlet  unstart  that  occurs  at  time  t  =  10  seconds.  It  should  be  noted  in  Figure  11  that  while  the  equivalence 
ratio  is  commanded  to  its  maximum  value  following  the  inlet  unstart,  thrust  is  not  being  generated  by  the 
vehicle  after  time  t  =  10  seconds.  While  tracking  performance  is  somewhat  degraded,  the  aircraft  is  still 
able  to  nominally  track  the  specified  flight  path  angle  trajectory. 


Flight  Path  Angle 


Figure  6.  Flight  path  angle  response  compared  with  the  generated  flight  path  angle  trajectory.  The  subscript 
p  represents  the  flight  path  angle  computed  from  the  polynomial  fit  of  h. 


IX.  Conclusions 

Because  the  dynamic  equations  for  the  GHV  are  inherently  nonlinear  and  the  aerodynamic  and  control 
derivatives  for  the  aircraft  have  significant  uncertainty  associated  with  them,  a  nonlinear  adaptive  dynamic 
inversion  control  architecture  was  selected  as  the  preferred  control  architecture  to  stabilize  and  control  the 
aircraft.  Based  on  the  simulation  results  and  the  robustness  analysis,  it  can  be  seen  that  the  objective  of 
designing  a  control  architecture  that  is  robust  in  order  to  achieve  desired  tracking  performance  was  achieved 
for  the  GHV.  The  control  architecture  is  robust  to  decreases  in  control  surface  effectiveness,  changes  in 
system  parameters,  and  time  delays  of  0.04  s  or  less.  The  responses  for  tracking  generated  flight-path  angle 
trajectories  are  well  behaved,  and  the  necessary  control  effort  for  tracking  is  not  excessive.  Additionally,  the 
control  architecture  is  able  to  tolerate  an  inlet  unstart  and  maintain  nominal  tracking  of  a  specified  flight 


20 


Copyright  ©  2013  by  the  American  Institute  of  Aeronautics  and  Astronautics,  Inc.  All  rights  reserved. 


control  surface  deflection  (deg) 


P,  Q,  R  U,  V,  W 


Figure  7.  State  responses  for  the  generated  flight  path  angle  trajectory. 
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Figure  8.  Control  and  adaptive  weight  responses  for  the  generated  flight  path  angle  trajectory. 
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Flight  Path  Angle 


Figure  9.  Flight  path  angle  response  compared  with  the  generated  flight  path  angle  trajectory  during  an  inlet 
unstart.  The  subscript  p  represents  the  flight  path  angle  computed  from  the  polynomial  fit  of  h. 


Figure  10.  State  responses  for  the  generated  flight  path  angle  trajectory  during  an  inlet  unstart. 
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Figure  11.  Control  and  adaptive  weight  responses  for  the  generated  flight  path  angle  trajectory  during  an 
inlet  unstart. 


path  angle  trajectory.  Therefore,  it  can  be  concluded  that  this  approach  of  nonlinear  adaptive  dynamic 
inversion  control  works  well  as  a  control  architecture  for  the  GHV. 
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Autonomous  soaring  is  a  concept  in  which  the  endurance  of  unmanned  aircraft  can  be 
increased  by  exploiting  wind  updrafts.  Recent  research  has  explored  traditional  feedback 
control  methods  for  autonomous  navigation  of  vehicles  to  thermal  updrafts.  This  paper 
develops  an  approach  for  planar  lateral/directional  guidance  of  a  linear  dynamic  gliding 
aircraft  to  a  known  thermal  location.  Reinforcement  learning  is  utilized  to  generate  refer¬ 
ence  bank  angle  commands  for  directing  the  aircraft  to  close  proximity  of  the  updraft,  and 
from  there  the  aircraft  follows  a  circling  trajectory  centered  on  the  thermal  to  gain  energy. 
A  Lyapunov-based  feedback  control  law  is  used  to  generate  bank  angle  commands  when 
circling  the  thermal.  By  using  reinforcement  learning  the  problem  of  online  trajectory  gen¬ 
eration  is  reduced  to  a  simple  search  in  a  static  state-action  value  table.  This  approach  has 
the  advantage  of  low  computational  burden/overhead  in  practice.  Furthermore,  the  need 
for  a  precise  aircraft  model  for  learning  and  simulation  is  reduced.  Monte  Carlo  results 
presented  in  the  paper  demonstrate  that  the  reinforcement  learning  guidance  agent  can 
consistently  navigate  the  aircraft  to  the  thermal.  Reliable  navigation  is  achieved  after  a 
relatively  small  number  of  learning  episodes.  An  analysis  of  typical  energy  gains  circling 
a  thermal  of  constant  shape  and  size  is  also  presented.  These  results  indicate  that  the 
approach  is  a  suitable  candidate  for  autonomous  soaring. 


Nomenclature 


/ 3  aircraft  sideslip  angle 

p  aircraft  roll  rate 

r  aircraft  yaw  rate 

<p  aircraft  bank  angle,  defined  as  positive  when  rolling  right 

'ip  aircraft  heading  angle,  defined  as  positive  when  east  of  north 

5 a  aileron  deflection  angle 

5r  rudder  deflection  angle 

Xk  value  of  the  discrete-time  variable  x  at  time  t  =  T(k  —  1),  where  T  is  the  sample  period 
a  step-size  parameter  for  reinforcement  learning 

7  discount-rate  parameter  in  reinforcement  learning 

Q*  optimal  state-action  value  function 

Q  learned,  approximate  state-action  value  function 

s  learning  state 

a  available  actions,  given  the  learning  state  s 

e  frequency  with  which  exploratory  actions  are  taken  under  e-greedy  policy 
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I.  Introduction 


Wind  velocity  gradients  in  the  atmosphere  enable  long-duration  missions  of  gliding  aircraft.  By  flying 
near  wind  updrafts,  gliders  can  gain  altitude  with  a  minimal  expenditure  of  energy.  This  process  is  referred 
to  as  “static  soaring.”  Thermals  are  one  source  of  updrafts  and  are  caused  by  uneven  heating  of  the  earth’s 
surface,  leading  to  localized  regions  of  heated  air  that  rise.  Static  soaring  is  the  process  by  which  a  gliding 
vehicle  gains  altitude  (and  ultimately  energy)  by  flying  through  or  circling  an  updraft.  Thermals  are  the  most 
common  updraft  source  for  manned  gliders  and  have  been  exploited  since  the  early  20th  century  to  extend 
the  range  and  endurance  of  gliding  flight.2  Glider  pilots  may  identify  thermals  by  noticing  the  formation  of 
cumulus  clouds,  the  composition  of  the  local  terrain,  or  birds  circling  the  updraft.  Within  the  last  fifteen 
years,  thermal  soaring  has  attracted  attention  as  means  of  extending  the  mission  duration  of  unmanned 
air  systems  (UAS).  Miniaturized  sensors  enable  small  UAS  to  accomplish  tasks  that  previously  would  have 
required  a  manned-scale  vehicle.  However,  small  UAS  lack  the  range  and  endurance  of  larger  vehicles  due  to 
the  relatively  low  energy-to-weight  ratio  of  batteries  and  motors  at  this  scale.  Thermal  updrafts  represent 
a  way  for  small  UAS  to  extend  their  flight  time,  and  by  exploiting  updrafts  a  battery-powered  electric  UAS 
can  theoretically  achieve  endurance  comparable  to  that  of  a  solar-powered  aircraft  while  supporting  a  higher 
wing  loading  and  smaller  battery  size.  Compared  to  a  solar-powered  aircraft,  the  soaring  UAS  has  greater 
flexibility  to  carry  larger  payloads  or  to  satisfy  performance  constraints. 

Most  of  the  flight  test  results  in  literature  have  performed  some  variation  on  monitoring  the  aircraft’s 
total  energy  to  infer  the  presence  of  thermals  from  local  air  currents.5,6.  Some  research  has  considered 
remote  detection  of  the  thermal  from  the  UAS  using  visual  or  infrared  (IR)  imaging.  Akhtar  conducted 
field  trials  with  an  IR  camera,  and  demonstrated  that  the  camera  could  detect  “hot  spots”  on  the  ground  as 
well  as  clouds,  both  of  which  correlate  with  the  presence  of  updrafts.  Another  work  describes  calibration  of  a 
thermal  infrared  camera  for  remote  sensing  onboard  a  UAS.  Although  thermal  soaring  was  not  considered, 
a  similar  architecture  might  be  used  for  remote  detection  and  exploitation  of  updrafts. 

A  variety  of  approaches  have  been  investigated  to  achieve  autonomous  soaring.  Ref.  4  implemented  an 
asymptotically  stable  thermal  centering  control  and  validated  the  work  in  flight  test.  Ref.  9  simultaneously 
explored  and  exploited  an  unknown  wind  field  using  a  Gaussian  Process  model  for  mapping.  A  receding 
horizon  framework  for  optimizing  energy  gain  based  on  local  knowledge  of  the  wind  field  was  implemented 
in  Ref.  10.  This  study  considered  both  thermal  soaring  and  dynamic  soaring  in  which  energy  is  absorbed 
from  local  wind  gusts.  Allen  and  Lin  used  a  guidance  and  control  algorithm  in  which  the  aircraft  estimated 
the  location  of  a  thermal  based  on  energy  measurements  and  an  assumed  thermal  profile.  Flight  test  results 
were  presented  and  showed  an  average  energy  gain  of  173  m  per  thermal  over  seventeen  flights.  In  one  case 
the  effective  endurance  was  extended  from  two  to  fourteen  hours.  Edwards  presents  flight  test  results  in 
which  an  autonomous  glider  traveled  48  km  over  1.5  hrs  from  an  initial  altitude  of  140  m.  In  this  approach 
the  local  updraft  speed  was  estimated  and  a  grid  of  nodes  was  evaluated  to  determine  the  most  likely  center 
of  a  parameterized  thermal. 

This  paper  develops  a  novel  concept  for  autonomous  navigation  using  reinforcement  learning  (RL)  to 
generate  bank  angle  commands  that  navigate  an  unmanned  glider  to  a  thermal  updraft  at  a  known  location. 
This  approach  is  distinct  from  previous  works,  which  generally  localize  thermals  using  the  energy  rate  of 
change  with  some  form  of  feedback  control  for  guidance  and  navigation.  The  RL  algorithm  used  is  Q-learning, 
in  which  an  approximation  to  the  optimal  state-action  value  mapping  is  learned  and  stored.  This  control 
scheme  has  three  primary  advantages.  First,  by  allowing  low-level  feedback  controllers  determine  the  required 
control  deflections,  the  reinforcement  learning  agent  can  be  trained  in  a  theoretically  model-agnostic  fashion, 
so  long  as  the  vehicle  can  match  the  commanded  bank  angle  sufficiently  quickly.  This  fact  should  also  make 
the  RL  guidance  law  robust  to  some  plant  uncertainties,  reducing  the  need  for  very  accurate  modelling  in 
simulation.  Second,  by  utilizing  RL  to  generate  bank  angle  commands,  the  problem  of  generating  reference 
trajectories  to  the  updraft  is  essentially  reduced  to  a  table  lookup  problem.  This  arrangement  allows  for  a 
computationally  simple  control  scheme  consisting  of  the  RL  lookup  for  navigation  with  generic  state  feedback 
command  and  hold  controllers  to  determine  control  deflections.  A  small  UAS  platform  is  assumed  to  have 
limited  onboard  capacity  to  carry  sensors  and  computing  equipment,  and  reducing  the  navigational  overhead 
frees  up  computing  power  for  other  tasks.  Furthermore,  the  navigation  component  can  be  easily  integrated 
with  existing  autopilot  hardware  and  software.  Third,  the  use  of  Q-learning  gives  a  designer  flexibility  to 
tailor  and  optimize  performance  by  shaping  rewards  appropriately. 

The  paper  is  organized  as  follows.  The  RL  algorithm  and  its  application  to  the  aircraft  guidance  problem 
are  considered.  This  is  followed  by  modelling  and  simulation  of  the  vehicle.  Results  of  a  preliminary 
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implementation  using  RL  for  lateral/directional  control  with  conventional  longitudinal  control  are  then 
presented,  and  conclusions  and  plans  for  extending  this  implementation  are  given. 

II.  Control  Policy 


II. A.  Q-learning 

Reinforcement  learning  describes  a  general  class  of  algorithms  that  attempt  to  learn  state-action  mappings 
to  maximize  a  reward  signal.  With  Q-learning,  a  decision  agent  seeks  to  learn  the  optimal  action-value 
function,  Q*(at,st),  which  is  the  value  of  taking  action  at  from  the  state  st .  The  algorithm  learns  Q,  an 
approximation  to  Q* .  For  learning,  an  e-greedy  policy  is  used.  Under  this  policy,  the  decision  agent  takes 
randomly  selected  actions  with  frequency  e,  and  takes  the  action  with  the  highest  value  with  frequency  1  —  e. 
The  value  of  e  is  varied  from  1.0  initially,  to  encourage  learning,  to  a  minimum  of  0.05  over  the  course  of 
learning.  The  decision  agent  receives  rewards  based  on  the  states  visited.  From  any  given  initial  state, 
the  goal  of  the  learning  agent  is  to  maximize  the  total  reward  received  until  a  terminal  condition  is  met.12 
Q-learning  has  several  features  that  make  it  advantageous  for  the  autonomous  soaring  problem.  Q-learning 
is  guaranteed  to  converge  to  Q*,  although  it  requires  a  theoretically  infinite  number  of  learning  episodes. 
The  convergence  to  Q*  is  also  policy-independent.  Lastly,  learning  can  be  model-free,  in  that  the  planning 
agent  does  not  require  an  explicit  model  of  the  system. 

The  Q-learning  algorithm  is  straightforward.  In  the  update  equation  for  Q(s,  a),  a  is  a  step-size  parameter 
and  7  is  a  discount-rate  parameter  that  influences  the  extent  to  which  received  rewards  influence  the  learned 
value  of  subsequent  states.  Primed  quantities  represent  first  future  value. 

1.  Q(s,a)  is  initialized  arbitrarily 

2.  For  each  episode  in  the  learning: 

(a)  Initialize  at  state  s 

(b)  For  each  step  in  the  episode: 

i.  Choose  action  a  from  s  via  policy 

ii.  Observe  the  next  state,  s' 

hi.  Q(s,  a)  Q(s ,  a)  +  a[r  +  7maxa/Q(s',  a')  -  Q(s ,  a)] 
iv.  s  <—  s' 

(c)  Break  when  s  is  a  terminal  state 

II. B.  Q-learning  for  autonomous  soaring 

Previous  work  has  demonstrated  the  use  of  Q-learning  for  navigation  of  a  powered  vehicle.  In  the  current 
work,  a  linear  dynamic  sailplane  model  is  implemented  and  the  performance  of  RL  for  lateral/direction 
guidance  is  evaluated.  Two  primary  assumptions  are  made  in  developing  the  RL  controller.  First,  the  glider 
possesses  feedback  controllers  sufficient  to  match  bank  angle  commands;  second,  the  location  of  the  thermal 
center  is  known  a  priori  by  the  vehicle.  The  first  assumption  can  be  readily  achieved  by  even  hobby-grade 
equipment.  The  second  assumption  is  somewhat  more  limiting.  Remote  detection  using  an  IR  camera 
or  other  sensor  is  the  assumed  modus  operandi.  The  use  of  RL  is  restricted  to  guidance  of  the  UAS  to 
the  proximity  of  the  thermal;  from  this  point  a  feedback-based  circling  controller  is  used  to  evaluate  the 
altitude  gain  at  the  thermal.  Longitudinal-axis  control  is  effected  by  a  state  feedback  regulator  about  a 
steady-state.  The  RL  guidance  control  is  discrete  and  is  wrapped  around  a  continuous-time  Proportional- 
Integral-Derivative  (PID)  controller  for  bank  angle  command  and  hold.  The  only  requirement  of  the  PID 
controller  is  that  it  consistently  match  the  commanded  bank  angle  faster  than  the  update  rate  of  the  RL 
guidance  algorithm,  without  destabilizing  the  system  or  driving  other  states  to  unacceptably  large  values. 
The  RL  guidance  algorithm  can  be  summarized  as  follows: 

For  each  episode: 

1.  Initialize  the  state:  r,  the  range  to  the  thermal,  A,  azimuth  to  the  thermal  relative  to  aircraft  heading, 
</>,  current  bank  angle 

2.  Loop  over  500  discrete  steps,  or  until  a  break  condition  is  met: 
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(a)  Choose  action  from  available  commanded  changes  in  bank:  A 


0  -Acf)  A  cj) 


(b)  Simulate  the  system  with  commanded  bank  angle  0  + A  for  Tq  seconds,  using  the  continuous-time 
bank  angle  and  elevator  controls. 


(c)  Observe  the  next  state.  If  r  >  where  R\  is  the  maximum  allowed  range  to  the  thermal,  receive 
a  penalty  reward  and  break.  If  r  <  R2,  where  R2  is  the  effective  radius  of  the  updraft,  receive 
a  goal  reward  rg  and  break  the  loop.  Else,  receive  no  reward,  and  continue. 


Once  learning  is  conducted  with  one  vehicle  model,  the  Q-learning  update  rate  Tq  can  be  tailored  to 
provide  guidance  for  any  vehicle,  regardless  of  its  forward  speed.  This  could  eliminate  the  need  for  re¬ 
learning  the  Q-matrix  whenever  the  plant  model  changes.  In  practice,  the  low-level  bank  angle  controller 
also  influences  performance  so  it  may  be  difficult  to  select  the  appropriate  value  of  Tq  for  any  glider.  It 
is  anticipated  that  the  Q-matrix  from  one  vehicle  can  be  used  to  “seed”  an  initial  Q-matrix  for  a  different 
vehicle,  reducing  the  overall  learning  time. 


III.  Aircraft  Modelling  and  Simulation 


III. A.  Aircraft  model 

The  dynamical  aircraft  model  is  based  on  a  Schweizer  SGS 
1-36  sailplane  (Fig.  1).  The  SGS  1-36  is  a  manned,  single¬ 
seat  advanced  trainer.  Ref.  1  gives  coefficients  for  a  linear 
model  of  an  SGS  1-36  modified  for  flight  at  high  angles  of 
attack,  using  values  derived  from  flight  test  data  in  Ref.  15. 

The  continuous-time  linear  model  and  steady-state  values 
used  in  simulation  are  given  in  Appendix  A.  The  SGS  1-36 
is  not  an  ideal  match  for  the  intended  application  of  this  RL 
controller,  as  it  is  substantially  heavier  and  faster  than  what 
would  be  considered  a  “small”  unmanned  glider.  However, 
it  is  assumed  that  a  manned  glider  is  more  representative 
of  the  intended  system  dynamics  than  a  powered  unmanned 
vehicle  model,  relatively  few  gliding  vehicle  models  are  avail¬ 
able  in  the  literature,  and  using  a  large  vehicle  allows  future 
evaluation  of  how  well  the  Q-learning  will  scale  to  a  smaller 
glider. 

III.B.  Q-learning  settings 

For  the  purposes  of  learning,  the  maximum  allowed  range  to  the  thermal  is  R\  =  1200  m.  If  the  aircraft 
flies  too  far  away,  the  learning  agent  receives  a  penalty  and  the  episode  terminates.  A  goal  reward  is  given  if 
the  aircraft  flies  within  the  effective  updraft  radius,  which  is  approximately  460  m.  The  goal  and  boundary 
rewards  are  +20  and  —20  respectively.  The  maximum  bank  angle  is  ±30°,  and  if  the  RL  agent  attempts 
to  command  a  bank  angle  outside  that  range,  another  action  is  selected  and  the  simulation  continues.  A 
timestep  of  Tq  =  5  sec  is  used  for  the  navigation  problem,  and  the  RL  agent  commands  changes  in  bank 
angle  of  Ac/)  =  5°. 

III.C.  Low-level  control  laws 

The  bank  angle  control  law  is  a  simple  continuous-time  PID  aileron  control  law  given  by  Eq.  (1).  In 
this  equation,  <fir  is  the  reference  bank  angle  generated  from  the  RL  agent,  and  is  treated  as  piecewise 
constant  between  RL  updates.  The  gains  used  in  simulation  are  Kp  =  5.5 ,  iQ  =  0.1,  =  7.0  and  the 

maximum  commanded  aileron  deflection  allowed  is  25°.  The  gains  are  selected  so  that  the  vehicle  can  match 
a  5°commanded  change  in  bank  to  with  0.01°in  five  seconds,  without  exceeding  an  aileron  deflection  of  5°. 


Figure  1.  Schweizer  SGS  1-36  sailplane  in 
flight.  Source:  “Schweizer  1-36  Photo  Collec¬ 
tion,”  Dryden  Flight  Research  Center,  Accessed 
2013/06/04,  http://www.dfrc.nasa.gov/Gallery/Photo/ 
Schweizer- 1-36/HTML/ECN- 26847 . html 
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Spid  =  —Kp{(j)  -  <t>r)  -  Ki  fo(0(t)  -  (f>r)dt  -  Kd<fr 
$max 
$max 


$max  ^  fipid  ^  $ max 

fipid  ^  fimax 
3pid  ^  fimax 


(1) 


The  elevator  control  law  is  a  simple  Linear  Quadratic  Regulator  (LQR)  state  feedback  control  law  given 
in  Eq.  (2),  where  u  is  the  perturbed  body  1-axis  speed,  a  is  the  perturbed  angle-of-attack,  q  is  the  body 
2-axis  (pitch)  angular  rate,  and  0  is  the  perturbed  pitch  attitude  angle.  The  LQR  weighting  matrices  were 
selected  to  closely  regulate  u  and  a  and  tuned  to  produce  well-damped  convergence  to  the  steady-state  in 
the  presence  of  the  external  updraft  during  circling  flight. 


0.1416  1.925  -0.4718 


-1.479 


a 

Q 

0 


(2) 


III.D.  Thermal  model 


The  thermal  updraft  model  is  taken  from  Ref.  11.  In  the  following  equations,  re*  and  Zi  are  convective-layer 
scale  parameters.  Updraft  velocity  is  calculated  using  Eq.  (3),  in  which  z  is  the  height-above-ground  altitude 
of  the  aircraft.  Note  that  in  the  convention  used  in  Eq.  (3)  z  is  defined  as  positive  up. 


wT  =  w* 


(i-LiJ) 

^ i 


Updraft  diameter  D  increases  exponentially  with  increasing  altitude: 


(3) 


D  =  0.203  (  —  )  3  (1  —  0.25— )zj  (4) 

Note  that  in  11,  conservation  of  mass  is  used  outside  the  updraft  to  calculate  a  downdraft  velocity.  For 
the  purposes  of  simulation  downdraft  velocity  in  this  study  is  assumed  negligibly  small.  This  updraft  model 
produces  a  radially  symmetric  thermal.  The  updraft  parameters  used  are:  w*  =  2.56  ™  and  Z{  —  660  m, 
and  the  updraft  diameter  is  scaled  so  that  the  updraft  radius  at  a  300  m  altitude  is  approximately  460  m. 
The  updraft  is  applied  as  a  disturbing  input  on  the  aircraft  u ,  <a,  and  /3  channels.  The  question  of  whether 
the  energy  gained  from  circling  a  thermal  is  worth  the  energy  cost  of  flying  to  it  is  not  considered  here. 
Consequently,  when  thermal  circling  effectiveness  is  examined,  an  updraft  of  constant  size  and  strength  is 
used  in  all  simulations.  Since  evaluation  of  the  circling  controller  is  not  the  primary  objective,  the  updraft 
size  is  effectively  arbitrary.  Two  factors  are  considered  in  sizing: 

1.  The  updraft  radius  is  chosen  to  be  wide  enough  that  the  SGS  1-36  model  can  perform  steady  level 
turns  at  a  bank  angle  of  30°or  less  at  an  altitude  of  300  m. 


2.  the  updraft  height  is  chosen  so  that  most  simulations  using  the  circling  feedback  control  law  will 
experience  a  net  loss  in  energy  if  circling  continues  for  1000  seconds  or  less. 


The  purpose  of  the  circling  feedback  control  law  is  to  establish  a  reasonable  performance  baseline  against 
which  an  RL-based  circling  controller  can  be  evaluated.  For  numerical  convenience,  a  relatively  weak  thermal 
is  selected  so  that  the  total  time  the  aircraft  can  circle  before  it  begins  losing  energy  can  be  compared  using 
relatively  short  simulation  times. 
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III.E.  Thermal  circling  control  law 


A  feedback  control  law  is  used  to  create  baseline  circling  results 
against  which  a  later  RL-based  controller  can  be  evaluated. 

The  objective  of  this  controller  is  only  to  provide  a  baseline  of 
reasonably  good  performance,  with  no  consideration  of  opti¬ 
mizing  any  performance  metric.  Since  the  baseline  controller 
need  not  be  implemented  in  experiment,  real-time  computa¬ 
tional  tract  ability  is  largely  ignored.  A  Lyapunov-based  feed¬ 
back  control  law  is  designed  that  guarantees  convergence  to  a 
reference  trajectory  under  the  assumption  of  constant  speed 
level  turns.  This  approximation  is  found  to  yield  acceptable 
performance. 

The  coordinate  system  referenced  for  the  circling  control  is 
shown  in  Fig.  2.  An  inertial  north-east-down  reference  frame 

is  used.  The  variable  to  be  controlled  is  77,  the  angle  between  Figure  2.  Coordinate  system  used  for  circling 
the  aircraft  X-Y  plane  velocity  vector  V  and  the  normal  to  contro1  law-  Values  are  positive  as  shown, 
the  X-Y  plane  position  vector  e7.  77  can  be  written  in  terms  of  the  position  vector  r  and  velocity  vector  as 
in  Eq.  (7): 


r  =  rer 

(5) 

dr 

(6) 

dt 

=  V  =  rer  +  rye7 

r 

tan  77  =  — 

(7) 

r  7 

Defining  the  error  coordinate  =  77  —  r\ref  in  terms  of  77  and  a  reference  angle  ?7re/,  a  stabilizing  control 
law  can  be  developed  beginning  with  the  Lyapunov  function  V  =  ^e2.  Its  time  rate  and  a  stabilizing 
controller  are  given  by  Eqs.  (8)  and  (9). 


Vz*en 


rr 7  ■ 


•  9  • 

•  r  7  • 


.  rrr y 


Rl^t]  — 


+  (ry)2 
■  r2  7  - 


Vref 


■  rr 7 


y.2 


+  (rj)2 


Vref 


The  second  derivative  of  the  X-Y  position  vector  is: 


(8) 

(9) 


cPr 
dt 2 


(r  —  ry2)er  +  (2ry  +  7"7)e7 


(10) 


Under  the  assumption  of  steady,  level,  1  g  turns  the  aircraft  acceleration  vector  is  —  gtan(/>er.  Under 
this  assumption  the  second  derivatives  of  r  and  7  are  given  by  Eqs.  (11)  and  (12).  Under  the  further 
assumption  that  the  magnitude  of  the  velocity  V  is  approximately  the  steady-state  forward  speed  C/i,  the 
continuous-time  control  law  used  to  generate  reference  bank  angles  is  given  by  Eq.  (13): 


g  tan  0  =  7*7 


2 


r  =  ry2  —  g  tan 
— 2ry 


U? (flref  -  KlCtj)  ~  r2j 


(11) 

(12) 

(13) 


Reference  trajectories  are  generated  by  computing  polynomial  fits  for  r( 7)  that  connect  an  initial  point 
ro,7o  to  a  final  point  on  the  desired  circular  trajectory  at  70  +  7 r.  Continuity  of  r  and  its  first  derivative  at 
the  initial  state  are  enforced  along  the  reference,  and  the  target  state  is  r  =  Rd,r  =  0.  Starting  from  the 
controller  used  in  Ref.  11,  the  target  radius  is  eventually  selected  to  be  Rd  =  O.6R2,  with  R2  being  the  local 
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updraft  outer  radius.  New  trajectories  are  computed  every  half-orbit  of  the  updraft  and  the  aircraft  is  found 
to  converge  to  a  circular  trajectory  within  one  or  two  orbits  typically.  The  feedback  gain  Kl  =  0.25  gives 
acceptable  results. 

III.F.  Simulation  framework 

Fig.  3  shows  the  coordinate  system  used  for  learning  and  its  relation  to  the  dynamical  system  coordinates 
used.  Inertial  position  is  given  by  a  north-east-down  reference  frame  centered  at  the  thermal  at  ground  level. 
A  standard  body-fixed  reference  frame  is  attached  to  the  aircraft  with  the  1-axis  out  the  nose,  the  2-axis 
along  the  right  wing,  and  the  3-axis  down.  The  body  frame  is  related  to  the  inertial  frame  via  the  standard 
3/2/1  Euler  angle  rotation  sequence  through  ip/O/cp. 

The  variables  r  and  7  are  polar  coordinates  for  the  X-Y  plane  position  vector  of  the  vehicle.  Note  that 
r  is  the  length  of  the  projected  position  vector  and  is  always  positive.  A  is  the  angle  between  the  X-Y  plane 
projection  of  the  aircraft  body  1-axis  and  the  vector  from  the  aircraft  to  the  thermal;  A  is  positive  when  the 
thermal  is  out  the  right  wing.  A  is  related  to  heading  angle  ^  and  7  by  Eq.  (14): 


7r  Y  7  —  A  =  ip 


(14) 


For  learning  the  Q-matrix  in  the  navigation  problem,  the  glider  is  initialized  with  uniformly  distributed 
random  states  500  m  <  r  <  700  m,  —135°  <  A  <  135°  and  zero  bank  angle.  The  inertial  state  is  always 
initialized  at  Z  =  —300  m  and  at  arbitrary  Y,  Y,  ip  coordinates  that  satisfy  the  relative  placement  specified 
by  A.  All  other  aircraft  perturbed  states  are  assumed  zero. 

IV.  Numerical  Results 

The  RL  navigation  agent  is  trained  on  100,000  episodes  using  an  e-greedy  policy.  100,000  episodes  was 
found  to  be  a  sufficient  number  to  achieve  acceptable  performance.  The  RL  agent  was  evaluated  in  1,000 
Monte  Carlo  simulations,  each  of  which  terminates  when  the  aircraft  either  flies  within  the  updraft  radius  R2 
or  outside  the  boundary  Ri,  or  after  500  discrete  timesteps.  Subsequently,  the  circling  control  law  is  evaluated 
in  1,000  simulations,  each  starting  at  the  final  state  of  an  RL  evaluation  episode.  Results  are  presented  in 
two  sections.  The  first  section  displays  performance  of  the  RL  navigation  agent  with  representative  episodes 
and  analyzes  the  Monte  Carlo  results.  In  the  second  section  the  baseline  circling  performance  is  considered. 

IV. A.  RL  navigation  Monte  Carlo  results 

For  Monte  Carlo  simulations,  the  set  of  initial  conditions  is  expanded  to  250  m  <  r  <  700  m,  —15°  <  (p  <  15° 
with  —135°  <  A  <  135°  as  before.  A  graphical  summary  of  the  success  and  failure  cases  as  a  function  of 
initial  conditions  is  shown  in  Fig.  5.  In  1,000  simulations  the  RL  agent  correctly  navigates  to  the  thermal  in 
962  of  the  cases.  The  cases  in  which  the  agent  failed  mostly  correspond  to  nonzero  initial  bank  angles  at  or 
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near  extrema  of  range  and  azimuth.  The  aircraft  presumably  visited  these  states  infrequently  if  at  all  during 
learning,  so  failure  is  predicated  on  whether  the  initial  conditions  naturally  drive  the  aircraft  towards  or 
away  from  states  the  RL  agent  has  learned.  Since  the  navigation  agent  is  successful  in  a  high  percentage  of 
cases,  it  is  reasonable  to  expect  that  with  additional  learning  the  agent  could  consistently  reach  the  thermal 
from  this  expanded  set  of  initial  conditions.  It  is  important  to  note  that  when  the  agent  is  evaluated  on  1,000 
episodes  using  the  same  initial  conditions  used  for  learning  in  Sec.  III.F,  the  agent  succeeds  in  all  cases. 


The  altitude  lost  in  navigating  to  the  thermal  in 
Monte  Carlo  simulations  is  shown  in  Fig.  4.  This  is 
used  as  a  performance  metric.  Note  that  the  cases 
in  which  the  UAS  did  not  reach  the  thermal  are  not 
shown.  The  distribution  exhibits  two  distinct  modes 
for  altitude  loss  of  less  than  100  m.  There  are  also 
26  flights  with  distinctly  greater  altitude  loss  than 
the  others.  The  latter  group  demonstrate  inefficient 
flight  paths  in  which  the  vehicle  typically  rotates 
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Figure  4. 
navigation. 

through  a  full  2tt  rotation  in  heading  before  reaching  the  thermal.  This  behavior  could  be  caused  by  the  use 
of  new  initial  conditions  in  Monte  Carlo,  or  could  simply  have  occurred  because  the  RL  reward  structure  does 
not  give  any  preference  to  short  paths.  The  bimodality  of  the  remainder  of  the  distribution  is  unexpected, 
since  the  initial  conditions  were  chosen  uniformly.  The  bimodal  distribution  is  presumed  to  be  either  an 
effect  of  not  using  a  time-optimal  reward  structure  or  a  natural  consequence  of  the  dynamics  of  the  problem. 

Plots  of  ground  tracks  and  bank  angle  histories  for  three  simulations  are  shown  in  Fig.  6.  The  initial 
conditions  shown  were  selected  arbitrarily  from  the  set  of  Monte  Carlo  initial  conditions,  for  the  purpose  of 
showing  typical  histories  only. 


(a)  Initial  range  versus  azimuth. 


(b)  Initial  bank  angle  versus  range. 


Figure  5.  Summary  results  of  Monte  Carlo  success/failure  versus  initialization  conditions. 


IV.B.  Baseline  circling  results 

The  thermal  circling  baseline  results  are  generated  by  evaluating  1,000  simulations.  Each  simulation  starts 
at  the  final  state  of  a  Monte  Carlo  navigation  episodes,  using  the  fixed  thermal  described  in  Sec.  III.D.  The 
thermal  is  deliberately  weak  for  this  scale  of  vehicle  for  numerical  convenience  in  future  comparison  with  an 
RL  circling  controller.  The  circling  implementation  does  not  stop  circling  at  peak  altitude;  for  simplicity, 
state  histories  are  simulated  for  1,000  seconds  and  the  aircraft  continues  circling  until  either  the  time  runs 
out  or  the  aircraft  hits  the  ground.  Results  are  evaluated  in  two  figures.  First,  the  time-of- flight  before 
the  current  kinetic  and  potential  energy  normalized  by  mass  is  less  than  its  initial  value  is  computed.  A 
histogram  of  time-of-flight  is  shown  in  Fig.  7(a).  Note  that  in  51  cases,  the  aircraft  energy  was  still  increasing 
after  1000  seconds.  The  remainder  of  the  cases  are  significantly  right  skewed,  having  a  median  of  118.2  and 
mean  of  151.3  sec. 
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Initial  conditions:  r  =  700  m,  A  =  -1 35  deg. 


Initial  conditions:  r  =  250  m,  A.  =  -90  deg. 


Initial  conditions:  r  =  350  m,  A  =  -45  deg. 


(a)  Case  1 


(b)  Case  2 


(c)  Case  3 


Figure  6.  Representative  X-Y  position  and  bank  angle  histories  showing  RL  commands.  Initial  position  is  indicated 
by  an  open  circle  and  final  position  is  indicated  by  a  circled  dot. 


Histogram  of  final  time  with  net  positive  energy  (sec) 
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(a)  Histogram  of  circling  time-of-flight  for  which  altitude 
remains  greater  than  altitude  when  circling  started. 


(b)  Scatterplot  of  peak  altitude  versus  initial  altitude  for 
circling. 


Figure  7.  Thermal  circling  results. 


Second,  peak  altitude  is  plotted  as  a  function  of  initial  altitude  in  Fig.  7(b).  For  a  majority  of  cases, 
the  relationship  appears  to  be  roughly  linear.  However,  there  are  approximately  140  cases  that  exhibit 
very  different  behavior.  These  cases  are  scattered  roughly  uniformly  between  initial  altitudes  of  210  and  290 
meters  and  have  peak  altitudes  bounded  above  by  about  650  meters  and  bounded  below  by  the  approximately 
linear  behavior  of  the  other  data.  While  this  second  set  does  appear  to  contain  the  51  cases  in  which  energy 
had  not  yet  peaked,  there  are  still  90  cases  for  which  the  cause  of  differing  behavior  has  not  been  explained. 
Additional  analysis  did  not  determine  any  trend  between  initial  conditions  and  peak  altitude  gain  which 
might  be  explain  the  very  different  results  for  these  90  cases.  Despite  this  somewhat  unexpected  behavior, 
the  circling  controller  provides  a  baseline  against  which  future  RL-based  circling  can  be  evaluated.  The 
mean  difference  in  initial  and  maximum  altitude  is  50.1  m  with  a  standard  deviation  of  69.9  m. 

V.  Conclusions 

This  paper  presents  the  development  and  initial  simulation  results  for  reinforcement  learning-based  navi¬ 
gation  of  a  gliding  vehicle  to  reach  a  remotely  detected  thermal  updraft.  This  approach  has  the  advantage  of 
low  computational  overhead  in  practice.  It  is  model-agnostic  if  certain  conditions  are  met  so  learning  should 
be  transferable  between  different  vehicles.  By  using  Q-learning,  a  significant  flexibility  to  tailor  performance 
via  reward  shaping  is  introduced.  Numerical  simulation  is  performed  using  a  simulated  Schweizer  SGS  1-36 
glider  dynamic  model  with  a  constant  updraft.  Monte  Carlo  results  show  that  the  RL  guidance  agent  navi¬ 
gates  to  the  updraft  in  100%  of  cases  when  initialized  at  the  same  states  used  to  initialize  learning.  When 
the  set  of  initial  conditions  is  expanded  to  include  initial  bank  angles  and  ranges  not  used  in  learning,  the 
navigation  agent  succeeds  in  96.2%  of  Monte  Carlo  cases.  This  performance  is  achieved  after  a  relatively 


9  of  ll 


American  Institute  of  Aeronautics  and  Astronautics 


small  number  of  learning  episodes  (100,000).  A  baseline  analysis  of  altitude  and  flight  time  gains  from  cir¬ 
cling  the  thermal  is  performed  using  a  Lyapunov-based  circling  law.  This  control  law  establishes  a  baseline 
mean  altitude  gain  of  50.1  m. 

For  future  work,  four  primary  areas  for  further  study  have  been  identified.  First,  it  is  of  interest  to 
compare  the  RL  guidance  agent  against  a  time-optimal  agent  as  a  metric  of  navigation  efficiency.  It  may  be 
appropriate  to  weight  the  learning  rewards  so  as  to  give  a  greater  reward  for  reaching  the  updraft  quickly, 
as  there  is  currently  no  mechanism  to  ensure  that  the  agent  selects  a  direct  route.  Second,  the  issue  of 
disturbance  tolerance  in  the  trajectory  generation  should  be  considered.  The  commanded  bank  angle  is 
essentially  open- loop  for  the  duration  of  the  reinforcement  learning  timestep;  it  may  be  necessary  to  either 
reduce  the  timestep  or  implement  a  system  to  detect  if  tracking  the  reference  angle  is  causing  other  aircraft 
states  to  diverge.  Third,  the  ability  of  an  RL  guidance  agent  to  fly  close  to  an  updraft  and  gain  energy 
should  be  examined.  A  future  work  is  planned  that  will  encompass  RL  guidance  for  thermal  circling  as  well 
as  longitudinal- axis  control  for  3D  flying  with  non-constant  updrafts.  Fourth  and  finally,  the  extensibility  of 
this  approach  should  be  evaluated  by  extending  the  Q-matrix  learned  on  the  SGS  1-36  to  a  new  plant. 
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A.  Dynamic  Model 


The  lateral-directional  continuous-time  dynamic  model  derived  from  Refs.  1  and  15  is: 
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The  longitudinal  axis  continuous-time  dynamic  model  from  the  same  sources  is: 
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The  steady-state  speeds  and  attitude,  estimated  from  the  flight  test  results  in  Ref.  15,  are  listed  below. 
Unlisted  values  are  zero  in  the  steady-state. 


•  Body-axis  forward  speed:  U\  =  34.4063  ™ 

•  Pitch-axis  Euler  angle  (positive  nose  up):  6\  =  —1.8873° 

•  Angle  of  attack  (positive  nose  up):  0.1766° 

•  Body- axis  vertical  speed  (positive  down):  W\  =  0.1061  ™ 
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Acceleration  control  of  highly  agile,  aerodynamically-controlled  missiles  is  a  well-known 
non-minimum  phase  control  problem.  This  problem  is  revisited  here  for  a  planar  tail- 
controlled  generic  missile,  and  a  globally  stable  nonlinear  autopilot  command  structure  is 
synthesized  to  maximize  performance.  For  the  first  time  the  non-minimum  phase  charac¬ 
teristics  of  the  vehicle  are  addressed  by  making  no  modification  to  the  output  definition  by 
inducing  an  inherent  time  scale  separation  in  the  closed-loop  dynamics.  Unlike,  previous 
time  scale  control  techniques,  results  presented  here  are  based  on  theoretical  advancements 
made  in  control  of  nonlinear  singularly  perturbed  systems.  Conditions  under  which  the 
induced  time  scale  separation  can  be  employed  for  a  stable  autopilot  design  are  also  dis¬ 
cussed.  The  state  feedback  controller  proposed  is  real-time  implementable,  independent 
of  operating  condition  and  desired  output  trajectory.  Simulation  results  presented  in  the 
paper  show  that  the  approach  is  able  to  accomplish  perfect  tracking  while  keeping  all 
closed-loop  signals  bounded. 


Nomenclature 

{a^fjB  forward  and  normal  linear  aerodynamic  accelerations  [(aG  x)b ,  (clg  z)b\ 

(Fg)b  horizontal  force  acting  about  the  c.g  of  the  missile  represented  in  the  body  frame 

(. Fz)b  vertical  force  acting  about  the  c.g  of  the  missile  represented  in  the  body  frame 

(. My)b  pitching- moment  about  the  c.g  of  the  missile  represented  in  the  body  frame 

q  dynamic  pressure 

Cm  pitch- moment  coefficient 

Cx  horizontal  force  coefficient 

Cz  vertical  force  coefficient 

G  center  of  gravity  of  the  generic  missile 

g  acceleration  due  to  gravity 

h  altitude  of  the  missile,  =  ~(zG)i 

Iyb  moment  of  inertia  of  the  missile  about  the  ys  axis 

Ki  feedback  gains,  positive  quantities 

lref  reference  length 

M  Mach  number 

m  mass  of  the  generic  missile  model 

q k*  body  pitch  rate  of  the  missile  relative  to  the  North-East-Down  frame 

Sref  reference  surface  area 

ux  forward  velocity  in  the  body  frame 

lateral  velocity  in  the  body  frame 
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Vg  total  velocity  of  the  missile 
w x  vertical  velocity  in  the  body  frame 

xo,y0,z0  triad  representing  the  North-East-Down  frame 
%B,yB,ZB  triad  representing  the  body- frame,  B 

Subscripts 

,  a  derivative  with  respect  to  angle-of-attack 

,  5  derivative  with  respect  to  pitch-control  deflection 

,  A  aerodynamic  component 

,  G  gravitational  component 

,  q  derivative  with  respect  to  pitch  rate 

0  reference  quantity 

r  reference  trajectory 

Symbols 

aSfc  angle-of-attack  of  the  missile 

/3^  side-slip  of  the  missile 

5m  pitch  control  deflection 

6  pitch-attitude  angle  of  the  missile 

0(.)  Order  symbol 


I.  Introduction 

The  main  challenge  in  designing  a  controller  for  an  aerodynamically  controlled  missile  is  to  achieve 
maximum  performance  it  is  capable  of,  without  exciting  the  unstable  internal  dynamics.  To  qualitatively 
understand  this  non-minimum  phase  behaviour  consider  the  control  problem  of  accelerating  the  missile 
(Figure  1)  upward.  Typically  a  tail-controlled  missile  (i.e  control  surface  aft  of  the  center  of  gravity,  G)  is 
statically  stable  with  Crricx  <  0,  CZ5  <  0  and  Cm5  <  0.  This  means  that  a  negative  unit-step  pitch  deflection 
command  initially  induces  a  downward  force  on  the  missile  causing  the  missile  to  accelerate  downward.  This 
downward  force  also  induces  a  counter-clockwise  pitching- moment  about  the  center  of  gravity  that  tries  to 
push  the  nose-up.  But  due  to  the  inherent  tendency  of  the  missile  to  oppose  any  such  change  in  angle- 
of-attack  the  missile  continues  to  accelerate  downward  until  an  overall  positive  pitching  moment  about  the 
center  of  gravity  develops.  Eventually  the  trim  angle-of-attack  and  consequently  the  lift  acting  on  the  vehicle 
increase  which  together  create  an  upward  force  about  the  fuselage;  and  thus  the  missile  accelerates  upward 
as  desired. 


Figure  1.  Generic  tail-controlled  missile  with  body  fixed-frame  B  and  North-East-Down  Frame  0 

The  above  described  non-minimum  phase  behaviour  is  a  characteristic  of  several  important  flight  control 
problems  such  as  control  of  Vertical  Take-off  and  Landing  (VTOL)  aircraft,  and  Conventional  Take-off  and 
Landing  (CTOL)  aircraft.  This  behaviour  is  modeled  in  the  open- loop  input-output  transfer  function  as  a 
zero  in  the  right  half  s-plane.  An  autopilot  design  based  on  cancellation  (more  commonly  known  as  feedback 
linearization)  for  such  class  of  dynamical  systems  gives  an  unstable  closed-loop  as  one  of  the  closed-loop 
poles  migrates  into  the  right  half-plane.1  Researchers  in  the  past  have  mitigated  this  undesirable  behaviour 
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by  slightly  modifying  the  output  definition,2.3  These  approaches  guarantee  ‘local’  tracking  specific  to  the 
desired  operating  condition  and  reference  command  as  a  result  of  the  modification.  Other  methods  proposed 
modify  the  sign  of  some  of  the  control  derivatives  in  the  differential  equation  description  of  the  input-output 
relationship  to  render  the  modified  output  dynamics  minimum  phase. 

The  common  control  technique  used  for  missiles  is  the  gain-scheduled  linear  controller,3,6/.  However, 
dynamics  of  a  missile  significantly  change  during  flight  and  the  nonlinearities  of  the  system  must  be  explicitly 
taken  into  account  during  design  of  the  controller  for  improved  maneuverability  and  stealth.  The  common 
nonlinear  technique  is  to  employ  approximate  input-output  linearization  5  wherein  the  coupling  between 
the  pitching  moment  and  the  aerodynamic  pitch-deflection  surface  that  causes  the  non-minimum  phase 
behaviour  is  ignored  during  autopilot  design.  This  approximation  render  the  input-output  relationship 
minimum  phase  but  is  limited  in  performance  and  domain  of  operation. 

This  paper  revisits  the  acceleration  control  problem  for  a  planar  aerodynamically-controlled  generic 
missile  and  the  main  result  is  a  globally  stable  nonlinear  autopilot  design  that  makes  no  modification  to 
the  output  definition  and  induces  a  time  scale  separation  in  the  closed-loop.  Unlike  previous  time  scale  ap¬ 
proaches 1  1  ’  ’  and  stable-inversion  results  presented  here  are  based  on  theoretical  advancements  in  control 

of  nonlinear  time  scale  systems;  successful  applications  of  which  have  been  demonstrated  on  non-minimum 
phase  nap-of-the-earth  maneuver  for  CTOL  aircraft  and  hover  control  of  an  autonomous  helicopter.  The 
state  feedback  controller  designed  here  is  real-time  implementable  and  independent  of  any  particular  oper¬ 
ating  conditions  and  desired  output  trajectory.  It  is  causal  and  does  not  require  any  knowledge  or  preview 
of  the  output  trajectory  beforehand.  This  is  an  important  contribution  as  the  proposed  nonlinear  autopilot 
can  be  integrated  with  an  on-board  guidance  module  for  real-time  operation. 

The  paper  is  organized  as  follows.  Section  II  describes  the  medium-fidelity  generic  missile  model  used  for 
simulation  that  includes  realistic  actuators  and  sensors,  and  Section  III  formulates  the  control  problem.  The 
main  ideas  of  the  paper  and  derivation  of  the  autopilot  are  presented  in  Section  IV.  The  evaluation  of  the 
developed  approach  in  simulation  is  presented  in  Section  V.  Finally,  conclusions  are  presented  in  Section  VI. 

II.  Simulation  Model  of  the  Missile 

This  section  describes  the  governing  equations  for  a  generic  missile  depicted  in  Figure  1  along  with  details 
of  the  actuator  and  sensor  subsystems. 


A.  Rigid-Body  Equations  of  Motion 


The  dynamics  of  the  three  degree-of-freedom  generic  missile  under  study 
first-order  differential  equations 

are  modeled  by  the  following  four 

-  ~  (Fx  )  B  ~  WK(lK  ! 

(la) 

=  —(Fz)b  +  uk(10ki 

(lb) 
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(lc) 

•OB 

Qk 
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written  in  the  body-frame  about  the  center  of  gravity,  G  using  Newton’s  laws  of  motion  under  the 
assumption  of  flat,  non-rotating  earth  and  balanced  lateral/directional  motion  with  Euler  angles  </>  and  ip, 
roll-rate  p,  and  yaw  rate  r  stabilized  about  the  origin.  The  reader  is  referred  to  the  nomenclature  for 
definition  of  the  various  symbols  in  (1). 

An  equivalent  representation  for  the  equations  given  in  (1)  can  be  obtained  by  noting  that  the  translational 
velocities  along  with  the  aerodynamic  angle  a and  the  absolute  velocity  V§_  satisfy  the  following 
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relationships: 


Vg  =  V(“Sf)2  +  K)2. 


a%  =  arctan  (  — S- 


Relations  given  in  (2)  can  be  further  re-arranged  as 


Uk  =  V£  cos<a^, 
sina^, 


(2a) 

(2b) 

(2c) 

(3a) 

(3b) 


to  give  expressions  for  the  translational  velocities  in  terms  of  the  angle-of-attack  and  the  total  velocity. 
Using  these  relations  in  (1)  gives  the  following  equivalent  dynamical  model  representation  of  the  generic 
missile: 
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This  representation  is  used  in  simulation  to  validate  the  performance  of  the  autopilot. 

The  forces  (Fj)b,  (^z) b  an(^  the  moment  (^y) B  (^)  consist  of  contributions  from  gravitational, 
aerodynamic,  and  propulsive  components.  Specifically,  the  aerodynamic  forces  (Fa)b  and  the  pitching 
moment  (Ma)b  with  respect  to  the  center  of  gravity  are  calculated  via  the  following  equations: 


=  qSi 
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where  lref  and  Sref  denote  the  reference  length  and  the  reference  area,  respectively  and 

Cx, 0(a£,  M)  +  Cx,Alt(a%,  M )  +  CXtq(a%,  /?g,  M)q°KB  +  CXtgM  (a£,  0%,  M)6m 
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The  aerodynamic  coefficients  defined  above  are  nonlinear  but  smooth  functions  depending  on  the  angle-of- 
attack  c^,  side  slip  angle  and  Mach  number,  M.  The  flight  envelope  covers  the  Mach  M  =  0.9, . . .  ,4.4 
and  altitude  region  h  =  0, . . . ,  11km. 


B.  Modeling  of  the  Actuator  and  Sensor  Subsystem 

The  missile  is  controlled  via  its  fin  attached  at  the  tail  of  the  missile’s  body  denoted  here  in  the  paper  as 
5m-  The  fin  is  modeled  as  a  second-order  linear  system  (described  by  damping  and  a  natural  frequency) 
with  deflection  5^irnit^  deflection  rate  5^irnit^  and  deflection  acceleration  5^irnit^  limits.  The  fin  deflection  is 
assumed  to  be  measurable. 

The  Inertial  Measurement  Unit  (IMU),  located  at  the  missile’s  c.g.,  is  modeled  as  a  first  order  element 
outputting  the  angular  rate  cpB  and  the  linear  accelerations  resulting  from  the  aerodynamic  forces  (5): 


(«a)b  =  —(Fa)b- 


(9) 
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C.  Parametric  and  Estimation  Uncertainties 


In  addition  to  the  outputs  obtained  from  the  sensors,  the  non-measurable  states  (e.g.  aerodynamic  angles, 
Mach  number,  etc.),  the  parameters  and  the  aerodynamic  data  (and  their  derivative  with  respect  to  angle- 
of-attack)  necessary  for  control  design  purposes  are  estimated.  The  uncertainties  associated  with  these 
estimates  are  assumed  to  time-independent  throughout  the  flight  envelope  and  Table  1  provides  an  overview 
of  the  considered  uncertainties. 


Table  1.  Overview  of  considered  uncertainties 


Variable 

Uncertainty  Range 

Units 

Iyb 

±5% 

[-] 

m 

±1% 

[-] 

x  eg 

±50 

[mm] 

±10% 

[-] 

Cm,0 

±20% 

[-] 

Cx,5m  5  ^z,8m  5  Cm, 5m 

±20% 

[-] 

Cm,q 

±20% 

[-] 

cyG 

aK 

±2.5 

[deg] 

M 

±10% 

h] 

Q 

±5% 

h] 

III.  Problem  Description 

The  objective  of  this  study  is  to  force  the  missile  to  track  a  desired  aerodynamic  normal  acceleration 
command  denoted  as  yr  asymptotically.  The  normal  acceleration  output  is  defined  as 

y—  ~  \Cz,o(aKi  Pki  M)  ±  fix,  M)q^  ±  Cz^m{o^<K’>  /?^,  M)Sm\  •  (10) 

The  control  problem  consists  of  generating  a  pitch  control  deflection  5m  that  produces  the  desired  normal 
acceleration  while  ensuring  complete  closed-loop  stability.  Throughout  this  study  the  total  velocity  is 
assumed  to  be  maintained  constant  through  application  of  appropriate  reaction  jets.  The  autopilot  is  turned 
on  when  the  missile  is  flying  at  the  steady- state  horizontal  flight  condition  at  h  m  5 km,  a =  Odeg  and 
5m  =  Odeg.  Additionally,  the  dynamics  of  the  pitch-attitude  angle  is  ignored  in  the  autopilot  design  as  it 
is  related  to  the  pitch  rate  q^3  through  an  exact  kinematic  relation  (see  (4)).  Consequently  the  autopilot 
design  for  acceleration  command  tracking  function  is  accomplished  through  study  of  the  following  reduced 
dynamical  model: 


a,g  _  n0B 

aK  —  Qk 


v G 


K  L 


B  {cos  aG (F§) B  -  sin  aG  (Fg) b  } 


n0B 

Qk 


(lla) 

(lib) 


with  output  defined  in  (10). 

For  convenience  in  design  and  analysis  the  following  compact  equations  for  (11)  and  the  output  (10)  will 
be  used  throughout  the  paper: 


a%=c0+  ciqp?  +  c2SM, 


(12) 
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where 


9  ( n  G  \  I  Q^ref  r, 

c0  =  T7g  cos(^  -  %)  +  lb° 


VG 

VK 


mVG 


G  •  G 1 

cos  —  ao  sm  <aKJ 


ci  =  1  +  ^-7775-  [61  cos  0//  —  cii  sina^-j 


mVg 


Q^ref  r  Q  Q 1 

c2  =  — ^  [o2  COS  -  a2  sm  a^J 


where 

_  qSreflref  A 
c3  —  - 7 - ^0 


C4  = 

C5 

and 


Iyb 
qSreflref 
Iyb 

qSreflref 

Iyb 


di 

d2 


(/If  =  C3  +  c4  qUK  +  c56m, 


OB 


(13) 


V  =  c6  +  c7q°K  +  c8SM 


(14) 


where 

Q^ref  7 

c6  =  - -b0 


C7 


C8  = 


m 

qSref 

m 

qSref  ^ 
m 


bi 


and  di,  bi ,  di  are  short-hand  definitions  for  the  aerodynamic  coefficients  defined  in  the  Table  2.  This  table 
also  gives  the  order  these  coefficients  take  over  the  range  of  angle-of-attack  and  Mach  number. 


Table  2.  Short-hand  definitions  for  aerodynamic  coefficients  used  in  control  design 


Definition 

Aerodynamic  Coefficient 

Order 

ao 

cx, o{a%,0$,  M)  +  CxM<*k,  ( zG)i ,  M) 

0(0.1) 

ai 

HX,q  5 

identically  0 

a2 

Cx,5m  (aK>  @K 5  M) 

0(0.1) 

bo 

0(10) 

bi 

Cz^qip^K-,  Pki  M) 

0(10) 

62 

Cz,sM(a%,l3Z,M) 

0(0.01) 

do 

0(10) 

di 

C'm,q(aKi  Pri  M) 

0(100) 

d2 

Cl 71, SM  (aKi  M) 

0(0.1) 

IV.  Autopilot  Design 

This  section  details  the  design  procedure  of  a  nonlinear  autopilot  for  the  output  (14)  using  the  reduced 
model  (12),  (13).  From  the  discussion  (see  Section  I)  of  the  non- minimum  phase  behaviour  of  a  generic  mis¬ 
sile  it  is  apparent  that  the  pitching  motion  plays  an  integral  part  in  ensuring  the  desired  normal  acceleration 
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command  is  followed.  This  important  role  of  the  angular  moment  is  not  accounted  in  autopilot  designs  based 
on  exact  input-output  feedback  linearization;  hence  resulting  in  undesirable  closed-loop  missile  behaviour.  A 
stable  missile  autopilot  design  must  ensure  that  the  control  deflection  being  commanded  generates  commen¬ 
surable  pitch  rate  through  change  in  aerodynamic  pitching  moment  that  will  essentially  produce  the  desired 
normal  acceleration.  This  requirement  is  posed  in  this  paper  in  the  form  of  a  sequential  two-part  control 
problem. 

Problem  1:  First,  given  the  desired  normal  acceleration  command,  determine  the  sufficient  pitch  rate 
required  to  change  the  angle-of-attack  of  the  missile  appropriately. 

Problem  2:  Second,  using  the  knowledge  of  the  computed  pitch  rate,  determine  the  control  deflection 
required  to  generate  the  appropriate  pitching  moment. 

Note  here  that  an  autopilot  designed  by  solving  the  above  problems  can  also  handle  time-varying  com¬ 
mands.  There  is  no  need  for  gain  scheduling  or  separate  design  for  different  command  set-points.  These 
computations  can  be  made  in  real-time  and  hence  for  the  same  vehicle  there  is  no  need  for  any  mission- 
dependent  tuning  of  feedback  gains.  Some  readers  may  recognize  that  the  two-part  problem  posed  above 
is  standard  in  flight  control  literature  for  aerial  vehicles.  But  it  is  important  to  point  out  that  success  of 
the  past  studies  relied  upon  the  presence  of  the  inherent  time  scale  separation  between  the  angular  moment 
and  the  translational  states.  The  main  contribution  of  this  paper  is  formulation  of  design  criteria 
that  guarantee  such  an  autopilot  design  can  be  implemented  for  vehicles  that  either  do  not 
possess  a  time  scale  separation  or  those  whose  time  scale  separation  changes  widely  over  the 
flight  operation  envelope.  In  this  paper  these  conditions  are  developed  by  identifying  criteria  that  ensure 
the  autopilot  design  formulated  by  solving  Problem  1  and  Problem  2  stabilize  the  overall  missile  dynamics. 
Essential  ground  work  on  developing  these  conditions  comes  from  study  of  non-standard  singularly  perturbed 
systems  done  by  the  first  author.  In  this  paper  the  aim  is  to  develop  and  identify  the  practical  implications 
of  these  conditions  on  the  performance  and  the  robustness  of  a  missile  autopilot.  For  more  details  on  the 
general  conditions  the  reader  is  referred  to  15,17.  In  the  following,  subsection  A  details  the  design  procedure 
of  the  autopilot  and  subsection  B  develops  and  analyzes  the  important  design  criteria. 

A.  Control  Formulation 

In  this  section  control  law  for  the  pitch  control  deflection  5m(®-k,  cPk)  is  developed  by  sequentially  addressing 
Problem  1  and  Problem  2.  Toward  this  end,  the  discussion  for  solving  each  of  these  problems  is  detailed: 

1.  Solution  to  Problem  1 

The  objective  here  is  to  determine  the  sufficient  pitch  rate  required  to  follow  a  desired  acceleration  command. 
Let  us  denote  this  sufficient  pitch  rate  command  as  q°  since  it  is  different  from  the  dynamic  state  .  Thus, 
mathematically  the  objective  is  to  find  q°(t)  such  that  the  output  defined  in  (14)  satisfies  y{t)  -A  yr{t)  as 
time  t  -A  oo.  In  this  step  we  assume  that  the  pitch  deflection  command  is  such  that  the  dynamic  pitch  rate 
q^f(t)  is  identically  equal  to  q°(t)  (this  assumption  will  be  satisfied  upon  solving  Problem  2,  and  the  effect 
of  initial  error  will  be  analyzed  in  subsection  B).  Under  this  assumption,  the  output  (14)  can  be  rearranged 
as 


y  =  c6  +  c7q°  +  cs5m{(Xki  Q°)-  (15) 

From  a  mathematical  standpoint  we  find  that  the  sufficient  pitch  rate  q°  can  be  determined  by  straightforward 
algebraic  manipulation  of  (15).  But  this  computation  is  not  useful  for  control  as  it  does  not  accommodate 
for  desired  time-domain  specifications.  Hence,  we  take  an  additional  derivative  of  (15)  to  get 

y  =  c'6  +  c7q°  +  c'8c5m(<A,  4°)  +  c8SM(aiK,q0),  (16) 

by  noting  that  the  q°  acts  as  a  equilibrium  for  the  dynamic  pitch  rate  in  this  step.  From  Table  2  we  know 
that  the  nonlinearities  q  are  time-varying  due  to  changes  in  angle-of-attack  (Mach  number  is  assumed  to 
be  constant  in  our  study,  see  first  paragraph  under  section  III,  and  side  slip  is  maintained  at  zero).  This 
means  that  cq  =  J^b^-,  C7  =  and  ds  =  Furthermore  we  find  that  c%  =  ^e/  b 2  is 
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comparatively  smaller  than  cq  and  C7  due  to  small  contribution  from  the  aerodynamic  coefficient  62-  Using 
these  approximations  (16)  becomes 


V  = 


dc6 
da ^ 


a%  + 


9c7 
da ^ 


d%q°. 


(17) 


Analyzing  the  aerodynamic  data  from  reference  19  we  find  that  is  atleast  two  times  greater  in  magnitude 
than  We  make  use  of  this  fact  to  further  reduce  (17)  to 


V  = 


dc6 
da <%; 


d%. 


(18) 


Upon  expanding  out  the  angle-of- attack  dynamics  using  (12)  and  the  definitions  eg  =  qS™f  and  cio  = 
^<22  sinaS-  above  we  get 

VK 


y  = 


dc6 
da ^ 


c0  +  ciq°  + 


fb2c9 

\vg 


cos  a 


K 


Cio 


(19) 


Equation  (19)  gives  the  explicit  input-output  relationship  between  the  output  y  and  the  input  q°.  Thus, 
ignoring  the  small  contribution  from  62  and  computing  q°  such  that 

dc 

-T^c  [co  +  Cl q°  -  cw5M(,a%,  <70)]  =  v  (20) 

we  find  that  the  closed-loop  output  dynamics  satisfies 


y  =  v.  (21) 

The  term  V  acts  as  a  virtual  control  input  in  (21).  This  input  can  be  designed  using  any  linear  control 
technique  to  assign  desired  time-domain  specifications  to  the  output.  In  this  paper,  we  formulate  v  as  a 
dynamic  controller  of  the  form 


v  =  Kvx 


(22) 


where 

x  =  -Kp(Kvx  -  yr)  -  Ki(y  -  yr), 

Ki  are  positive  gains,  and  Kp  and  Ki  act  as  proportional  and  integral  gains  respectively  that  can  be  varied 
to  achieve  desired  output  performance.  However,  the  control  designer  must  be  wary  of  the  fact  that  these 
gains  also  influence  the  nonlinear  closed-loop  stability  of  the  missile  and  must  be  selected  by  ensuring  the 
criterion  derived  later  in  subsection  B  is  also  met. 

2.  Solution  to  Problem  2 

With  the  desired  pitch  rate  computed  using  (20),  the  objective  here  is  to  develop  a  control  law  for  the 
pitch  deflection  Sm{&ki  ensure  the  dynamic  state  q^jf  stabilizes  about  q°.  This  can  be  achieved  by 

noting  that  the  pitch  rate  dynamics  is  related  to  the  control  deflection  5m  via  equation  (13).  Using  feedback 
linearization  then,  the  control  law 


=  _Kq(&-f)+c 3+c4gr  (23) 

C5 

will  ensure  the  closed-loop  dynamics  of  the  pitch  rate  becomes 

q°KB  =  -Kq(q0KB-q0),  (24) 

such  that  it  is  uniformly  asymptotically  stable  about  the  q°.  The  constant  Kq  introduced  in  (23)  above  is 
another  positive  feedback  gain  selected  by  the  designer  to  specify  desired  time-response  properties. 
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The  final  step  in  the  control  design  process  is  to  ensure  the  control  law  is  specified  in  terms  of  quantities 
that  can  be  either  estimated  or  computed  in  real-time.  For  this,  rearrange  the  quantity  (20)  using  the 
expression  5m (o^,#0)  =  —  C3+C4<?  to  get  the  following  algebraic  expression 


dce_ 

da ^ 


Co  +  Ci  q°  +  Cio 


^c3  Tc4g°^ 


=  v 


(25) 


to  compute  q°  as  a  function  of  the  angle-of-attack  in  real-time.  Hence,  (25)  together  with  (22)  and  (23) 
describe  the  autopilot  control  law. 


B.  Closed-Loop  Analysis 

The  purpose  of  this  section  is  to  develop  criterion  under  which  the  control  law  designed  in  (23)  will  asymp¬ 
totically  stabilize  the  dynamics  of  the  missile,  and  ensure  the  output  follows  the  desired  command.  As 
mentioned  earlier  in  the  paper,  this  condition  must  be  met  when  selecting  the  feedback  gains.  This  analysis 
is  carried  out  using  the  Lyapunov’s  direct  method  and  is  detailed  in  the  following  five  steps. 

The  first  step  involves  developing  the  closed-loop  dynamics.  For  this  let  us  rearrange  (12)  and  (14)  by 
expanding  c 2  and  eg  to  get 


&K  ~  co  +  Ci  + 


y  =  c6  +  c7q°x  +  ec9SM 


C9  a 

e!7 GcosaK 
.  VK 


ClO 


(26) 

(27) 


using  the  definitions  of  eg  and  cio  used  in  (20)  and  e  =  62.  As  we  will  see  later  this  rearrangement  will  assist 
us  in  studying  the  effect  of  ignoring  62  dependent  terms  during  control  design  on  the  closed-loop  stability. 
Next  substitute  the  control  law  (23)  into  (26)  and  (13),  and  replace  the  output  y  in  (22)  with  (27)  to  get 
the  following  closed- loop: 


-i  =  ^ + ■=.  W  - 1°)  +  K,cf(i°KB  -  ?°) + < 

0^  °5  VK 

•0  B  ry-  /  0  B  0\ 

QK  ~  ~Kq[QK  ~  Q  b 

x  =  -Kp(Kvx  -  yr)  -  Kt  (c6  +  c7q^  +  ec9SM  ~  yr) 


(28a) 

(28b) 

(28c) 


The  closed-loop  system  obtained  in  (28)  is  time-varying  and  has  a  trim  corresponding  to  different  output 
commands  yr.  Since  our  aim  here  is  to  assess  stability  corresponding  to  all  possible  trims,  let  us  translate 
the  trim  of  the  closed-loop  system  to  the  origin.  For  this  we  develop  the  general  equilibrium  equations  that 
are  always  satisfied  in  the  following  second  step. 

Let  us  denote  (a*,q*,x*)  as  the  trim  point.  Setting  the  time  derivative  of  the  states  <a§-,  q^f  and  x  to 
zero  in  (28)  we  get  the  following  three  equilibrium  equations: 

q*  =  q°(a*),  (29) 

+e^  cos  a* 5M(a*,q*)  =  0,  (30) 

da x  'a*  K 

-Kp(Kvx*  -  yr)  -  Kj  (cb  +  c7q *  +  ec9SM(a*,q*)  -  yr)  =  0  (31) 

where  equations  (29)  through  (31)  are  written  about  a*.  Notice  that  trim  point  for  the  dynamic  pitch  rate 
g*  is  exactly  equal  to  q°  computed  about  a*,  as  desired.  Further  since  (31)  is  satisfied  only  if  the  individual 
components  are  zero  we  get 


Kv  %  —  Vr  5 

(32) 

da<3  |Q.e(a*)^G  cos  a* 5 M (a*,  q*)  =  yr, 

(33) 

c6(a*)  +  c7(a*)q*  +  e(a*)c95M(a*  ,q*)  =  yr, 

(34) 

that  define  three  equations  in  three  unknowns. 
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The  third  step  in  the  analysis  is  to  repose  the  closed-loop  system  (28)  such  that  the  origin  becomes  the 
unique  equilibrium.  This  is  done  by  using  the  equilibrium  conditions  (32)  through  (34)  in  (28) 


aGK  = 


•0  B 


Kvx  Kvx* 


dc6 
da % 


dcQ 

da g 
ClO 


+ 

1  IV  Cl° 

Cl  +  Kq - 

L  C5  J 

eT7GcosaK&M  ~  e(a*)^y  cos  a*  5m  (a*,  q*) 

.  VK  VK 


C1  +  Kq - 

C5 


(q*~q°(a%)) 


(35a) 


Qk  =  -KMk  -  Ql  ~  Kg(q*  -  q"(aG)),  (35b) 

x  =  —KpKv(x  -  x*)  -  Ki  [{c6  -  c6(ai*)}  +  {ci<Ik  ~  c7(a*)q*}  +  {ec95M  ~  e(a*)c9SM(a*  ,q*)}]  .  (35c) 


In  the  fourth  step  select  a  Lyapunov  function  candidate  V  =  \{x  —  x *)2  +  —  a*)2  +  \{q^  —  q*)2 

to  study  the  closed- loop  system  (35).  Taking  derivative  of  this  Lyapunov  function  candidate  we  get 

V  =  (1/ cg)(x  -  x*)x  +  (a%  -  a*)a%  +  ( q 0B  -  q*)q0B .  (36) 

Evaluating  (36)  along  (35) 

v  =  -  KvKp(1/c9)(x  -  X*)2  ~  Kq(q0B  -  q*)2  -  Kq(q*  -  q°(aG))(q°KB  -  q*) 

-  (1  /cq)Ki  [{c6  -  c6(a*)}  +  {c7q°x  -  c7(a*)q*}  +  {ec95M  ~  e(a*)c95M(.cc* ,q*)}]  (x  -  x*) 


Kvx 

Kvx *  " 

is^K  —  ^*)  A 

1  iv  Cl° 
C1  +  - 

L  C5  J 

(aG  -  a*)  (q°KB 

-«*) 

dcQ 

dc6  1 

Vda% 

da ^  'a*  _ 

Cg 

evG 

L  v  K 

cos  a^SM  ~ 

-  e(a*)^  cosa*SM(a*,q*)  +  jci  +  Kq~  j 

^(q*-q°(a%))  (a%  -  a*). 

(37) 

Note  that  the  difference  terms  in  (37)  are  a  function  of  the  design  constants  and  the  aerodynamic 
coefficients.  These  terms  capture  the  deviations  occurring  in  the  dynamics  of  the  missile  due  to  initial 
condition  errors,  and  approximations  made  during  control  design.  This  means  that  the  feedback  gains  must 
be  selected  in  a  way  that  ensures  the  derivative  of  the  Lyapunov  function  candidate  is  negative  definite 
for  a  range  of  parametric  uncertainties  and  reference  commands.  One  way  of  selecting  these  gains  is  by 
performing  Monte  Carlo  simulations  over  the  range  of  uncertainties.  Instead,  in  this  paper  we  take  the 
approach  of  deriving  sufficient  condition  for  selecting  feedback  gains  that  will  ensure  stability  over  a  block 
of  uncertainties.  This  is  done  in  this  fifth  step  by  verifying  whether  the  individual  difference  terms  are 
upper-bounded  by  some  inequality: 

1.  By  the  quadratic  nature  of  the  term  K%Kp(x  —  x*)2  we  know  that 
(1  /c9)KvKp(x  -  x*)2  =  (l/c9)KvKp\(x  -  x*)\2. 

2.  Similarly 

Kq(q0KB-q*)2=Kq\(q0KB-q*)\2. 

3.  Next  consider  the  difference  terms  appearing  due  to  x  dynamics.  Using  the  definitions  of  C6,  and  C7 
from  (14)  see  that 

Ki  [{c6  -  c6(a*)}  +  {c7q°K  -  c7(a*)q*}  +  {ec96M  -  e(a*)c95M(a* ,  q*)}]  (x  -  x*) 

=  K 1  c9  [{60  -  b0(a*)}  +  {b7q°KB  -  h(a*)q*}  +  {e5M  ~  e(a*)5M(a* ,q*)}]  (x  -  x*). 

From  aerodynamic  data  we  find  that  the  coefficient  bo(a^)  =  sj,o  (<A  —  ot*)  is  approximately  linear  in 
the  angle-of-attack  with  slope  0  =  —  |^6o  I  •  Additionally  since  the  coefficient  62  is  comparatively  small 
we  assume  that  e(a^-)  ==  e(a*)  =  —  |e|.  Then  the  above  difference  term  becomes 

Kic9  [{60  -  b0(a*)}  +  {b^fjf  -  bi(a*)q*}  +  {e5M  -  e(a*)5M(a* ,q*)}]  (x  -  x*) 

<  KjCg  [{sb0(a£  -  V)}  +  {b7q°KB  -  h(a*)q*}  +e{SM~  SM(a*,  </*)}]  (*  -  x*). 
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Using  the  control  law  from  (23)  the  above  inequality  can  be  further  rearranged  to 


KiCq  [{sb0(a.K  -  a*)}  +  {biq0B  -  b1(a*)q*}  +  e{SM  ~  bM(a*,q*)}]  (x  -  x*) 


<  KiCg 


{sbo(a%  -  a*)}  +  -  h(a*)q*}  +  —  {- Kq(q\ ^  -  q*)} 

C5 


( x  —  X *) 


<  -KlCQ\sb0\\{a%  -  a*)\\(x  -  x*)\ -  KjKgCg 
+  Krcg\by/}f  -  b1(a*)q*\\(x  -  x*)\ 


c  5 


\{tk  -  q*)\\{x  -  x*)\ 


since  C5  =  —  \c$\  due  to  the  nature  of  aerodynamic  coefficient  Using  the  triangle  inequality  \biq^f  — 
b±(a*)q*\  <  \bi(a^)  —  g*}  |  +  \bi(a^)  —  bi(a*)\\q*\  and  the  fact  from  the  aerodynamic  data 

\bi(aK;)  ~  bi(a*)\  <  — /?|a^-  —  a* |  where  /3  >  0  for  all  angle-of-attack  and  Mach  number  values,  then 

Ki  [{c6  -  C6(a*)}  +  {c7q°jf  -  c7(a*)q*}  +  {ec98M  -  e(a*)c9SM{a* ,  g*)}]  (x  -  x*) 


< -KjegMKaZ  -  a*)\l(x  -  x*)\  -  KjK.cg  -  \(q0B  -  g*)||(x  -  x*)\ 

C5 

+  KjegfaiaZ)  {q°KB  -  q*}  \\(x  -  z*)|  -  KlC9l3\q*\\a%  -  a*  \\(x  -  x*)\- 

4.  The  error  due  to  arbitrary  initial  condition  of  the  integrator  state  x  is  expressed  as 


Kvx  Kvx* 


dc6 

da% 


3cq 

dc^l 


K„ 


C9$b0 


l x-x  )\\{aK-a 


by  using  the  definition  of  the  aerodynamic  coefficient  cq  and  linear  dependence  of  bo . 

5.  This  next  term  appears  due  to  coupling  between  the  translational  and  rotational  motion  components: 


1  zv  Cl° 

Cl  +  Kq - 

C5 


(a%  -  a*)  (q%*  -  q*)  <  -|ci||(ag  -  a*)\\ (< j gf  -  q*)  |  -  Kt 


cio 


C5 


I  (aK  -  a*)ll  -  Q*) 


since  c\  and  C5  are  always  negative. 


6.  The  effect  of  ignoring  the  small  quantity  e  in  the  autopilot  design  is  captured  by  the  following  difference 
term: 


e^cos a%bM  ~  e(a*)^cos a*SM(a*,q*) 


<  e 


K 
C9 
VG 

VK  L 


(c$-a*) 


K, 


- -  cos cxk(Q^F  ~  Q*)  +  ^m(cU,  q*)  {cos<a^  —  cos  a*} 


C5 


(aK  ~  “*) 


< 


C9  Kq 

T 


eV°  clCOSa« 


\qT  -q*\\{a^K-a*)\- 


e^SM(a*,q*) 

VK 


*  \  1 2 


\iaK  —  a*)l 


where  we  have  used  the  control  law  (23)  and  the  assumption  that  the  e  variation  with  angle-of-attack 
is  negligible. 

7.  Finally  the  difference  in  the  pitch  rate  trim  point  q*  and  the  input  q°  is  a  function  of  the  difference 
angle-of-attack  and  its  trim  point  from  (25).  That  is 


«*-«>£)  = 

1 


1 

’  Kvx*  ' 

1 

KyX 

ci(a*)  +  ciocjjQ.j 

/  O’  \  I  ^4(0  TV') 

ci(aK)  +  cioC5(qg) 

dc6 

Laag  J 

ci(a*)  +  cio|gj 


|  c0(o*)+Cl0(a*)C3(a*) 


+ 


c5(a*)  J  r,(n.G\  I  Maj) 

J  cnaK)  +cioC5(aG) 


{ 


co (&k)  ffi  cio (p-k) 


P\Cz(a<K) 


C5  (aK^ 
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Using  properties  4  and  5  given  above, 


1 

’  Kvx*  ' 

1 

Kvx 

l 

Kv 

ci(a*)  +  c1Q  cf{a,\ 

dc6  1 

_  dotc^  'a*  _ 

rJnG\  +  rinC4("£) 
cl\aK>  +c10C5(aG) 

dc6 

-daK  - 

Cl  +ClO^ 

CgSbO 

1 


ci(a*)+cio|g^ 

and 

1 

ci(a*)+cio|g^ 


{c0  {^k)  —  co(a*)}  — 


Cl+C10^ 


{  — |ao||<ax  —  <2*1  —  |55o||<2^-  —  (2*1  +  |^0  |  \^K 


cio  (°lk) 


c3  (4)  /  *xc3(<2*) 


c5(4) 


Cio(<2*) 


C5((2*) 


< 


Cl  +Ci0^ 


cioCc)Tnlref 


c$Iyb 


{\sx>\\a% -<*'{} 


hold  using  the  definition  of  C3  and  the  fact  that  do  =  —  \sdo\{c^x  —  a*).  Then  we  have  that 


1 


Cl  +  Cio; 


C9S50 


|(x-x*)|  + 


{  —  |  <^0  |  —  |  ^60  |  +  \bo\} 


cioCc)Tnlref  Sdo 


c$Iyb 


*: 1 


Substituting  the  above  computed  inequalities  into  (37)  we  get 
V  =  ~(l/c9)KvKp\(x  -  X*)\2  -  Kq(q°KB  -  q*)2 


KYwl+KYlq* 


K, 


C9  |«§5o  | 


lcl|  + 


KqClO 


C5 


>  Ci  +  Ci 


Cl+Cio^ 


Kv 


C9S50 


<2 


K 


a  \\x  —  x 


\Cl\-Kq\C-^\  + 

c  5 


Kq  Cq  q 

eVCc™aK 


Kn 


ci+c10^ 


{-|oo|  —  \sbo\  +  |6o|}  + 


Cl0C9TYllrefSd0 


c$Iyb 


\aK  ~  a 


KiKq\  —  \  —  Ki\b\(a%)\ 


K„ 


C5 


Cl+Cio^ 


Kn 


C9S60 


r 

eT7G5M(a*  ,q*) 

VK 

+  |  Cl  1  + 

KqClO 

1  1  ( 

eg 

{ —  |tt0 1  —  \sbo\  +  \bo\}  + 

CloCgUllrej;SdO 

\ 

C5 

f  Ci  +  do^  V 

V 

c$Iyb 

) 

which  can  be  rearranged  to 


V<[(x-x*)  (o& -a*)  (q°I?-q*)]K[(x-x*)  (a%  -  a*)  (q°j?  -  q 


*M  T 


with  matrix  IK  defined  as 


i(<4 

(38) 

(39) 

(40) 


with  Hi  defined  in  Table  3.  Then  from  Lyapunov’s  method  it  is  clear  that  the  missile  will  have  stable  closed- 
loop  dynamics  if  the  matrix  IK  is  negative-definite.  Furthermore,  if  the  matrix  IK  is  negative  definite,  the 
states  (x,  <2^,  q^*)  will  asymptotically  approach  their  trim  points  defined  in  (32)  through  (34),  which  will  in 
turn  mean  that  the  output  will  asymptotically  approach  the  desired  reference  command  yr. 

The  above  analysis  indicates  that  the  feedback  gains  must  be  chosen  to  ensure  the  matrix 
IK  is  negative-definite  for  a  range  of  parametric  uncertainties.  This  is  a  sufficiency  condition  and 
can  be  met  by  following  the  two  steps.  First,  to  ensure  negative  definiteness  for  a  range  of  uncertainties  the 
designer  must  use  the  least  upper  bound  values  for  the  aerodynamic  coefficients  and  the  flight  conditions  in 


—  KvKp(l/ Cq) 

Ml 

M2 

Ml 

Mo 

M3 

M2 

Ms 

-Kc 
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Table  3.  Definition  of  elements  of  K 


Parameter 


Definition 


Q*) 

v  w 


{ 


+  Cl  + 


KgC  IQ 

C  5 


I) 


Ml 


KM+K.m-  ^-{ici  +  l^j} 


Cl+Cio^- 


Ky 


M2 


KjKqlj-J-KjMa&l- 


Ka 


Kv 


Cl+Cio^- 


M3 


-|ci|-^|ff|  + 


-  Cg 

y§  c5 


cos<a 


Cl+Cio^ 
_ ^5_ 


(I 


v  |  (_lao|  -  |sfeo| 


\bo\}  + 


CIOCq mlref  S do 

c$Iyb 


fii.  This  will  ensure  that  gains  computed  in  the  next  step  will  guarantee  stability  for  a  block  of  parametric 
values.  At  the  end  of  the  first  step  the  matrix  IK  will  become  purely  a  function  of  the  feedback  gains,  selection 
of  which  requires  some  iteration.  In  the  second  step  the  designer  must  make  an  initial  guess  for  Kv ,  Kp  and 
Kj.  This  can  be  done  by  noticing  that  the  output  response  transfer  function  is 


2/00  =  Kv[Kps  +  Kt\ 
yr(s)  s2  +  KvKps  +  KvKt 


(41) 


at  the  end  of  Problem  1.  This  means  that  appropriate  selection  of  Kv ,  Kp  and  Kj  will  help  in  assigning 
desired  time-domain  characteristics  to  the  output.  On  the  other  hand  feedback  gain  Kq  assigns  speed  of 
response  to  the  pitch  rate  dynamics  and  must  be  chosen  to  be  of  higher  order  than  Kv ,  Kp  and  Kj.  Using 
these  values  as  the  initial  guess  into  matrix  IK  the  designer  must  iterate  a  few  more  times  to  get  desired 
negative  definiteness.  These  iterations  may  be  required  for  two  reasons:  one  for  ensuring  the  stability  of 
the  nonlinear  closed-loop  system  and  the  other  to  ensure  the  closed-loop  has  desired  properties.  This  is  an 
important  feature  about  the  construction  of  matrix  IK.  Notice  that  due  to  choice  of  the  Lyapunov  function, 
the  eigenvalues  of  matrix  IK  directly  correspond  to  the  response  of  the  states  of  the  closed-loop  system  and 
can  be  modified  as  desired. 

Finally,  before  leaving  this  section  let  us  expand  further  on  the  choice  of  the  feedback  gain  Kq.  Looking 
back  at  the  closed-loop  (35)  notice  that  if  Kq  is  chosen  to  assign  a  small  time  constant  to  the  angular  rate 
dynamics,  then  after  some  finite  time  t*  which  is  greater  than  initial  time  the  dynamics  (35)  will  reduce  to: 


q°KB(t>t*)  =  q°(aG), 
Kvx  Kvx* 


a%  = 


dc6 

da^ 


dc6  | 

c>a£  I 


+ 


cosaf  5M(a%,q°)  -  e(a *)yc  cos  a*5M(a*,q*) 


'  K 


(42a) 

(42b) 


x  =  -KpKv(x  -  x*)  -  I<i  [{c6  -c6(a*)}  +  {c7q°  -  c7(a*)q*}  +  {ec9<5M(a:f,  q°)  -  e(a*)c9SM(a*,  q*)}]  . 

(42c) 


Use  the  definition  of  the  output  y  from  (27),  and  equilibrium  relation  (34),  to  further  reduce  (42) 


q0KB(t>t*)  =  q\aGK), 

Kvx  Kvx * 

+ 

c9 

dc6  dc6  1 

_  da^  dot^  'a*  _ 

evG 

L  VK 

x  =  -Kp(Kvx  -  yr)  -  Kj  [y(a%, 

n  n\ 


C9 


v  K 


(43a) 

(43b) 

(43c) 


or 


q°If(t>t*)  =  q°(a%), 
x  =  —Kp(Kvx  -  yr)  -  K i 
dy 


dy 

KyX 

KyX *  ' 

~Vr 

da% 

dcQ 

La«g 

dc6  I 
da^  'a*  _ 

-Kj 


da x 


(44a) 

(44b) 


cos  axSM(aK,  Q°)  ~  e(a*)  — ^  cos  a*SM(a*,q*) 

.  VK  VK 
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Notice  (44b)  is  composed  of  three  terms.  The  first  two  terms  appear  due  to  arbitrary  initial  conditions, 
and  from  design  of  virtual  input  v  (22)  these  terms  will  die  out  in  time.  The  third  term  captures  the  effect 
of  ignoring  62  in  the  control  design,  and  acts  as  an  external  disturbance.  Due  to  the  integral  action  of  the 
virtual  input  v  qualitatively  we  may  say  that  the  effect  of  this  disturbance  on  the  steady  state  response  of 
the  output  will  be  zero.  However,  notice  this  conclusion  was  based  on  the  assumption  that  the  pitch  rate 
dynamics  settles  down  faster  than  the  other  states.  This  means  that  essentially  the  choice  of  feedback 
gains  must  be  made  such  that  a  time-scale  separation  is  induced  in  the  closed-loop  system. 
Furthermore,  (44)  also  suggests  that  this  induced  time  scale  separation  does  not  depend  on  whether  or  not 
the  missile  has  inherently  time  scale  separated  dynamics. 

V.  Case  Study 

The  purpose  of  this  section  is  to  demonstrate  the  controller  performance  in  simulation  for  a  generic 
missile  model  presented  in  (4),  and  illustrate  how  matrix  K  can  be  employed  in  selection  of  feedback  gains. 
The  operating  point  =  600m/sec  and  h  =  5km  is  chosen  as  initial  condition  for  the  simulation.  The 
control  command  consists  of  step  inputs  applied  to  the  acceleration  channel  with  an  amplitude  of  lOg.  This 
step  input  command  is  passed  through  a  reference  model  to  create  a  ramp-like  reference  trajectory  for  the 
autopilot  to  follow.  With  the  chosen  maneuver,  the  overall  acceleration  commanded  to  the  missile  is  close 
to  the  maximum  trimmable  acceleration  at  the  considered  envelope  point. 

A.  Feedback  gain  selection 

The  feedback  gains  are  selected  by  carrying  out  the  two  steps  listed  in  analysis  section  above: 

1.  In  order  to  ensure  stability  for  the  range  of  values  given  in  Table  1  the  least-upper  bounds  (or  greatest 
lower  bounds  for  denominators)  for  the  aerodynamic  data  terms  are  determined.  These  are  tabulated 
in  Table  4  for  the  missile  model  under  study. 

2.  At  the  end  of  the  first  step,  matrix  K  depends  purely  on  the  feedback  gains.  In  this  second  step  these 
gains  need  to  be  varied  to  ensure  the  matrix  attains  negative  definiteness  properties.  This  will  require 
iterations  and  one  may  begin  by  noting  that  Kv  acts  as  a  multiplying  factor  in  (41)  and  can  be  fixed 
to  one  initially.  Furthermore,  if  there  are  no  uncertainties  in  the  system  then  integral  action  can  be 
temporarily  turned  off  and  thus,  Kj  =  0.  Then  only  Kp  and  Kq  need  to  be  varied  to  ensure  the 
matrix  K  is  stable.  But  due  to  integral  action  of  the  state  x  we  find  that  this  matrix  is  only  negative 
semi-definite  for  the  missile.  That  is  V  <0.  Then  since  the  Lyapunov  function  candidate  is  lower 
bounded  we  have  that  the  error  states  (x  —  x*),  ( a —  a*)  and  ( q —  g*)  are  bounded.  Using  this  fact 
and  the  Barbalat’s  lemma20  one  can  show  that  V  -A  0  as  t  -A  00,  and  hence  the  errors  (x  —  x*)  -A  0, 
(< a x  —  cU)  — y  0  and  (q^3  —  g*)  -A  0.  This  proves  that  the  errors  converge  to  zero  and  the  output  tracks 
the  reference  command  asymptotically.  For  the  simulation  results  presented  next  Ky  =  8,  Kj  =  2 
and  Kq  =  10  were  selected.  With  these  values  the  eigenvalues  of  the  matrix  K  were  computed  as 
Ai  =  — 0.22,  A2  =  — 0.16  and  A3  =  0. 

B.  Simulation  results 

Figure  2  presents  the  closed-loop  results  for  the  generic  missile  being  commanded  to  acceleration  of  lOg. 
Figure  2  indicates  that  the  autopilot  demonstrated  asymptotic  tracking  irrespective  of  the  commanded 
acceleration.  The  commanded  control  deflection  is  consistent  with  the  dynamics  of  the  vehicle  described  in 
Section  I.  Between  5— lOseconds  negative  pitch-deflection  commands  are  generated  to  induce  a  positive  change 
in  angle-of-attack  and  accelerate  the  missile  upward  as  desired.  This  negative  control  deflection  command 
is  corrected  at  lOseconds  when  the  commanded  acceleration  is  brought  back  to  Og.  Similar  behaviour  is 
seen  between  15  —  20  seconds  with  positive  control  deflection  command  inducing  a  negative  angle-of-attack 
decelerating  the  missile  to  —10 g.  At  20seconds  the  pitch-deflection  command  starts  to  go  back  to  0 deg  and 
around  22.5  seconds  the  missile  stabilizes  about  origin. 

The  inherent  tendency  of  the  missile  to  resist  change  is  evident  each  time  the  acceleration  command  is 
altered.  Initially  between  0  —  2.5  seconds  the  transient  changes  in  the  normal  acceleration  are  corrected 
through  small  control  deflections.  Furthermore  at  5seconds,  even  though  a  negative  deflection  is  generated, 
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Table  4.  Values  used  in  computation  of  m  (given  in  appropriate  units) 


Parameter 

Value 

14 

0.0027 

M 

49.9043 

P 

9.5077 

bbol 

33.07 

|  ^d0  | 

74.28 

N 

43.75 

kol 

0.2512 

C5 

0.176 

|ci| 

0.2428 

Ciol 

0.0696 

\q*  1 

7 r 

\SM(a*,q*)\ 

45(180/tt) 

the  missile  starts  to  accelerate  in  the  wrong  direction  until  sufficient  change  in  angle-of- attack  is  created. 
Similar  behaviour  is  seen  at  10,  15, and  20  seconds  respectively.  Despite  this  non-minimum  phase  behaviour 
the  autopilot  maintained  asymptotic  tracking  throughout  the  simulation. 

VI.  Conclusions 

In  this  paper  synthesis  of  a  nonlinear  autopilot  was  presented  for  a  generic  aerodynamically  controlled 
missile.  The  non- minimum  phase  characteristics  of  the  missile  were  addressed  by  inducing  a  time  scale 
separation  in  the  closed- loop  to  ensure  the  rotational  dynamics  remain  stable  at  all  times.  The  designed 
autopilot  exhibits  almost  linear  time-invariant  closed-loop  dynamics  as  a  result  of  which  the  designed  missile 
control  system  successfully  fulfilled  the  demanding  maneuver.  Based  on  the  results  presented  in  the  paper, 
the  following  three  conclusions  are  drawn.  First,  the  final  output  tracking  rise-time  of  the  step  response  is 
1.5seconds  and  the  settling  time  is  around  4seconds.  This  close  output  tracking  was  a  result  of  computing 
the  desired  pitch  rate  in  real-time.  Second,  the  autopilot  is  independent  of  the  commanded  reference  and 
globally  applicable.  Third,  unlike  other  exact  output  tracking  approaches  for  non-minimum  phase  systems, 
the  autopilot  is  causal  and  does  not  require  any  prior  information  or  preview  of  the  desired  reference. 
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In  this  project,  we  have  developed  sampling  based  feedback  planning  techniques  for  the 
solution  of  Markov  Decision  Problems  (MDP)  and  Partially  Observed  MDPs  (POMDP). 
The  primary  application  of  interest  has  been  the  feedback  motion  planning  for  unmanned 
robotic  vehicles,  both  ground  and  air  based.  The  techniques  have  been  implemented  on 
real  hardware  and  shown  to  be  robust  and  real  time  implementable.  We  have  also 
extended  these  techniques  to  the  case  of  multi-vehicle  systems.  The  following  papers 
have  resulted  form  this  work.  Attached  to  this  document  are  also  four  papers  that  best 
summarize  the  contributions  in  this  work. 
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The  papers  attached  are  1,  2,  3  and  6  which  encapsulate  the  key  contributions  of  this 
work.  A  brief  description  of  these  papers  is  given  below. 

Paper  1.  In  this  paper  we  present  feedback-based  information  roadmap  (FIRM),  a 
multiquery  approach  for  planning  under  uncertainty  which  is  a  belief-space  variant  of 
probabilistic  roadmap  methods.  The  crucial  feature  of  FIRM  is  that  the  costs  associated 
with  the  edges  are  independent  of  each  other,  and  in  this  sense  it  is  the  first  method  that 
generates  a  graph  in  belief  space  that  preserves  the  optimal  substructure  property.  From  a 
practical  point  of  view,  FIRM  is  a  robust  and  reliable  planning  framework.  It  is  robust 
since  the  solution  is  a  feedback  and  there  is  no  need  for  expensive  replanning.  It  is 
reliable  because  accurate  collision  probabilities  can  be  computed  along  the  edges.  In 
addition,  FIRM  is  a  scalable  framework,  where  the  complexity  of  planning  with  FIRM  is 
a  constant  multiplier  of  the  complexity  of  planning  with  PRM.  In  this  paper,  FIRM  is 
introduced  as  an  abstract  framework.  As  a  concrete  instantiation  of  FIRM,  we  adopt 
stationary  linear  quadratic  Gaussian  (SLQG)  controllers  as  belief  stabilizers  and 
introduce  the  so-called  SLQG-FIRM.  In  SLQG-FIRM  we  focus  on  kinematic  systems 
and  then  extend  to  dynamical  systems  by  sampling  in  the  equilibrium  space.  We 
investigate  the  performance  of  SLQG-FIRM  in  different  scenarios. 

Paper  2.  Motion  planning  in  belief  space  (under  motion  and  sensing  uncertainty)  is  a 
challenging  problem  due  to  the  computational  intractability  of  its  exact  solution.  The 
Feedback-based  Information  RoadMap  (FIRM)  framework  made  an  important  theoretical 
step  toward  enabling  roadmap-based  planning  in  belief  space  and  provided  a 
computationally  tractable  version  of  belief  space  planning.  However,  there  are  still 
challenges  in  applying  belief  space  planners  to  physical  systems,  such  as  the  discrepancy 
between  computational  models  and  real  physical  models.  In  this  paper,  we  propose  a 
dynamic  replanning  scheme  in  belief  space  to  address  such  challenges.  Moreover,  we 
present  techniques  to  cope  with  changes  in  the  environment  (e.g.,  changes  in  the  obstacle 
map),  as  well  as  unforeseen  large  deviations  in  the  robot’s  location  (e.g.,  the  kidnapped 
robot  problem).  We  then  utilize  these  techniques  to  implement  the  first  online  replanning 
scheme  in  belief  space  on  a  physical  mobile  robot  that  is  robust  to  changes  in  the 
environment  and  large  disturbances.  This  method  demonstrates  that  belief  space  planning 
is  a  practical  tool  for  robot  motion  planning. 

Paper  3.  This  paper  presents  a  strategy  for  stochastic  control  of  small  aerial  vehicles 
under  uncertainty  using  graph-based  methods.  In  planning  with  graph-based  methods, 
such  as  the  Probabilistic  Roadmap  Method  (PRM)  in  state  space  or  the  Information 
RoadMaps  (IRM)  in  information-state  (belief)  space,  the  local  planners  (along  the  edges) 
are  responsible  to  drive  the  state/belief  to  the  final  node  of  the  edge.  However,  for  aerial 
vehicles  with  minimum  velocity  constraints,  driving  the  system  belief  to  a  sampled  belief 
is  a  challenge.  In  this  paper,  we  propose  a  novel  method  based  on  periodic  controllers,  in 
which  instead  of  stabilizing  the  belief  to  a  predefined  probability  distribution,  the  belief  is 
stabilized  to  an  orbit  (periodic  path)  of  probability  distributions.  Choosing  nodes  along 
these  orbits,  the  node  reachability  in  belief  space  is  achieved  and  we  can  form  a  graph  in 
belief  space  that  can  handle  higher-order-dynamics  or  non-stoppable  systems  (whose 
velocity  cannot  be  zero),  such  as  fixed  wing  aircraft.  The  proposed  method  takes 
obstacles  into  account  and  provides  a  query-independent  graph,  since  its  edge  costs  are 
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independent  of  each  other.  Thus,  it  satisfies  the  principle  of  optimality.  Therefore, 
dynamic  programming  can  be  utilized  to  compute  the  best  feedback  on  the  graph.  We 
demonstrate  the  method’s  performance  on  a  unicycle  robot  and  a  six  degrees  of  freedom 
small  aerial  vehicle. 

Paper  6.  In  this  paper,  the  generalized  motion  planning  algorithm  (Generalized  PRM  : 
GRPM  [1,  2,  3])  is  extended  to  a  class  of  multi-agent  motion  planning  problem  in 
presence  of  process  uncertainty  and  stochastic  maps.  The  proposed  algorithm  is  a 
hierarchical  approach  towards  constructing  a  passive  coordination  strategy  which  utilizes 
an  existing  multiple  traveling  salesman  problem  (MTSP)  solution  methodology  in 
conjunction  with  the  GPRM  framework  to  solve  the  multi-agent  motion  planning 
problem.  The  proposed  algorithm  is  generalized  to  tackle  multi-agent  problems  involving 
heterogeneous  agents.  The  algorithm  is  used  to  solve  multiagent  motion  planning 
problems  involving  2-dimensional  and  3-dimensional  agents  in  stochastic  maps  with 
uncertainty  in  the  motion  model.  Results  indicate  that  the  algorithm  successfully  solves 
the  problem  under  uncertainty,  and  generates  a  solution  having  high  probability  of 
success.  It  also  demonstrates  that  the  algorithm  is  scalable  in  terms  of  number  of  start  and 
goal  locations,  the  number  of  agents  and  their  dynamics. 
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Abstract 

In  this  paper  we  present  feedback-based  information  roadmap  ( FIRM),  a  multi-query  approach  for  planning  under 
uncertainty  which  is  a  belief-space  variant  of  probabilistic  roadmap  methods.  The  crucial  feature  of  FIRM  is  that  the 
costs  associated  with  the  edges  are  independent  of  each  other,  and  in  this  sense  it  is  the  first  method  that  generates 
a  graph  in  belief  space  that  preserves  the  optimal  substructure  property.  From  a  practical  point  of  view,  FIRM  is  a 
robust  and  reliable  planning  framework.  It  is  robust  since  the  solution  is  a  feedback  and  there  is  no  need  for  expensive 
replanning.  It  is  reliable  because  accurate  collision  probabilities  can  be  computed  along  the  edges.  In  addition,  FIRM  is 
a  scalable  framework,  where  the  complexity  of  planning  with  FIRM  is  a  constant  multiplier  of  the  complexity  of  planning 
with  PRM.  In  this  paper,  FIRM  is  introduced  as  an  abstract  framework.  As  a  concrete  instantiation  of  FIRM,  we  adopt 
stationary  linear  quadratic  Gaussian  (SLQG)  controllers  as  belief  stabilizers  and  introduce  the  so-called  SLQG-FIRM.  In 
SLQG-FIRM  we  focus  on  kinematic  systems  and  then  extend  to  dynamical  systems  by  sampling  in  the  equilibrium  space. 
We  investigate  the  performance  of  SLQG-FIRM  in  different  scenarios. 


Keywords 

Planning,  control,  uncertainty,  information,  belief  space 


1.  Introduction 

Decision-making  under  uncertainty  is  a  crucial  ability  for 
most  robotic  systems.  In  the  presence  of  uncertainty  in  a 
robot’s  motion  and  uncertainty  in  its  sensory  readings,  the 
true  robot  state  is  not  available  for  decision-making  pur¬ 
poses.  In  such  cases,  a  state  estimation  module  can  provide 
a  probability  distribution  over  all  possible  states,  referred  to 
as  information  state  or  belief.  Therefore,  decision-making 
under  motion  and  sensing  uncertainties  needs  to  be  per¬ 
formed  in  the  information  space  (belief  space).  In  its  most 
general  form,  this  decision-making  can  be  formulated  as 
a  partially  observable  Markov  decision  process  (POMDP) 
problem  (Astrom,  1965;  Smallwood  and  Sondik,  1973; 
Kaelbling  et  al.,  1998).  However,  only  a  very  small  class  of 
problems  formulated  using  POMDP  can  be  solved  exactly 
due  to  its  computational  complexity  (Papadimitriou  and 
Tsitsiklis,  1987;  Madani  et  al.,  1999).  In  particular,  plan¬ 
ning  (i.e.  solving  POMDPs)  over  continuous  state,  control, 
and  observation  spaces  is  a  big  challenge. 

On  the  other  hand,  in  the  absence  of  uncertainty, 
sampling-based  path-planning  algorithms  including  graph- 
based  methods  such  as  probabilistic  roadmap  methods 
(PRMs)  (Kavraki  et  al.,  1996)  and  their  variants  (see 
e.g.  Amato  et  al.,  1998),  and  tree-based  methods  such  as 


rapidly-exploring  randomized  trees  (RRTs)  (Lavalle  and 
Kuffner,  2001),  expansive  space  trees  (Hsu,  2000)  and  their 
variants  (e.g.  Karaman  and  Frazzoli,  2011)  have  shown 
great  success  in  solving  robot  motion  planning  problems. 
Nevertheless,  direct  transformation  of  the  roadmap-based 
methods  to  planning  under  uncertainty  (in  belief  space)  is  a 
challenge  for  two  main  reasons.  The  first  issue  is  ensuring 
that  the  roadmap  nodes  are  reachable.  The  second  challenge 
is  that  the  incurred  costs  on  different  edges  of  the  roadmap 
depend  on  each  other,  which  violates  a  basic  assumption  in 
roadmap-based  methods  that  each  roadmap  edge  represent 
an  independent  planning  problem. 

In  this  paper,  we  generalize  the  PRM  framework  to 
obtain  the  feedback-based  information  roadmap  (FIRM) 
framework  that  takes  into  account  both  motion  and  sensing 
uncertainties.  FIRM  is  constructed  as  a  roadmap  (graph) 
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Fig.  1.  (a)  A  simple  PRM  in  state  space,  (b)  Assuming  Gaussian  belief  space,  belief  snapshots  along  different  paths  starting  from  v°  and 
ending  at  v1 1  are  shown.  As  can  be  seen,  the  obtained  belief  depends  on  the  path  traveled  by  the  robot.  For  example,  P1 1  ( 0, 1, 3, 6, 9, 10) 
denotes  the  estimation  covariance  at  node  v11,  when  the  robot  has  traversed  a  path  through  nodes  ( 0, 1, 3, 6, 9, 10)  prior  to  node  11.  (c) 
Corresponding  belief  paths  in  the  belief  space.  Belief  at  each  node  depends  on  the  initial  belief,  actions  taken  (edges),  and  obtained 
observations  (random).  Therefore,  the  generated  structure  in  the  belief  space  is  not  a  graph  but  a  random  tree,  (d)  Unique  beliefs 
assigned  to  each  PRM  node.  Using  stabilizers,  regardless  of  the  action  and  observation  history,  the  belief  at  each  node  stops  at  these 
predefined  beliefs,  (e)  The  FIRM  corresponding  to  the  given  PRM;  blc  denotes  graph  nodes  in  the  belief  space  and  [jlij  denotes  local 
planners  (graph  edges). 


in  the  belief  space,  where  graph  nodes  are  beliefs  (rig¬ 
orously  speaking,  small  subsets  of  the  belief  space)  and 
edges  are  local  controllers  in  belief  space.  FIRM  is  an 
abstract  generic  framework  that  relies  on  the  existence  of 
an  appropriate  belief  node  sampler  and  connector  (local 
controller).  We  also  construct  a  stationary  linear  quadratic 
Gaussian  controller-based  (SLQG-based)  instantiation  of 
this  generic  framework,  called  SLQG-FIRM,  where  we  pro¬ 
vide  a  specific  node  sampler  and  connector.  In  SLQG- 
FIRM  we  first  focus  on  the  kinematic  systems  and  then 
extend  it  to  dynamical  systems  by  restricting  sampling 
space  to  the  equilibrium  space.  SLQG-FIRM  is  the  first 
method  that  generalizes  the  PRM  to  the  belief  space  such 
that  the  incurred  costs  on  different  edges  of  the  roadmap 
are  independent  of  each  other,  while  providing  a  straightfor¬ 
ward  approach  to  sample  reachable  belief  nodes.  This  prop¬ 
erty  is  a  direct  consequence  of  utilizing  feedback  controllers 
in  the  construction  of  FIRM.  Based  on  this  property,  the 
FIRM  framework  breaks  the  curse  of  history  in  POMDPs 


(Pineau  et  al.,  2003),  and  provides  the  optimal  feedback 
policy  over  the  roadmap  instead  of  returning  a  single 
nominal  path. 

Figure  1  illustrates  the  problem  of  edge  dependence  in 
the  direct  transformation  of  PRM  to  stochastic  domains.  It 
also  shows  the  approach  of  FIRM  in  generating  a  graph  in 
belief  space  with  independent  edges.  Figure  1(a)  depicts 
a  simple  PRM  in  the  state  space  with  twelve  nodes  V  = 
{v°, . . . ,  v11}.  Figure  1(b)  shows  the  belief  evolution  on  the 
underlying  PRM.  Assuming  the  belief  is  Gaussian  in  this 
example,  we  represent  a  point  in  belief  space  using  a  mean 
Jc+  and  a  covariance  P,  in  other  words,  a  belief  b  is  charac¬ 
terized  by  the  pair  b  =  (x+,P).  In  Figure  1(b),  mean  values 
are  shown  by  small  filled  circles,  and  covariance  matrices 
are  shown  by  their  corresponding  3a  ellipses  centered  at 
the  mean.  We  drive  the  system  from  v°  toward  the  node  v11 . 
The  initial  belief  at  node  v°  is  b°  =  (xjj~,P°).  The  belief 
propagation  from  left  to  right  starting  from  bo  is  shown  in 
Figure  1(b). 
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Although  there  exists  a  single  edge  e(10,11)  between  nodes 
v10  and  v11  in  PRM  (see  Figure  1(a)),  the  belief  evolu¬ 
tion  along  e(10’H)  is  not  unique  (see  Figures  l(b)-(c))  since 
it  depends  on  (i)  the  initial  belief,  (ii)  obtained  observa¬ 
tions  (observation  history),  and  (iii)  the  path  taken  (action 
history)  that  has  led  to  v10.  Figure  1(c)  shows  the  cor¬ 
responding  belief  propagation  in  the  belief  space,  where 
each  rectangle  encodes  a  mean  and  covariance.  As  seen  in 
Figure  1(c),  the  belief  paths  do  not  form  a  graph;  rather, 
they  form  a  random  tree  in  belief  space.  Hence,  in  practice, 
where  observations  are  random,  not  only  does  the  number 
of  possible  beliefs  grow  exponentially,  but  the  belief  also 
evolves  randomly. 

Therefore,  to  predict  edge  costs,  full  knowledge  of  the 
belief  at  the  start  of  the  edge  is  required.  This  in  turn 
requires  full  knowledge  of  the  history  of  observations  and 
actions  leading  up  to  the  start  of  the  edge. 

Even  if  future  observations  were  assumed  to  be  deter¬ 
ministic  for  the  purpose  of  planning,  the  generated  structure 
would  still  be  a  tree  that  grows  exponentially  in  the  size  of 
the  underlying  PRM  graph. 

In  FIRM,  we  use  local  feedback  planners  to  drive  the 
belief  process  toward  the  predefined  unique  beliefs  asso¬ 
ciated  with  PRM  nodes  (see  Figure  1(d)).  As  a  result,  the 
evolution  of  belief  after  a  FIRM  node  is  reached  is  indepen¬ 
dent  of  the  evolution  of  belief  before  that  node  is  reached. 
This  breaks  the  curse  of  history,  allowing  us  to  construct 
a  PRM-like  roadmap  in  the  belief  space  with  independent 
edge  costs.  Therefore,  in  contrast  to  the  main  body  of  the  lit¬ 
erature  in  motion  planning  under  uncertainty,  FIRM  can  be 
re-used  for  future  queries  and  does  not  need  to  reconstruct 
the  roadmap  every  time  a  new  query  is  submitted. 

From  an  algorithmic  perspective,  this  edge  independence 
is  an  example  of  the  optimal  substructure  property.  A  prob¬ 
lem  has  an  optimal  substructure  only  if  the  optimal  solu¬ 
tion  can  be  obtained  from  a  combination  of  optimal  solu¬ 
tions  to  its  subproblems  (Cormen  et  al.,  2001).  To  solve 
a  problem  using  dynamic  programming  (DP)  or  its  suc¬ 
cessive  approximation  schemes,  such  as  Dijkstra’s  algo¬ 
rithm,  the  optimal  substructure  assumption  has  to  hold 
(Sniedovich,  2006),  that  is,  the  cost  of  any  subpath  has 
to  be  independent  of  what  precedes  it  and  what  succeeds 
it.  As  mentioned,  the  direct  transformation  of  sampling- 
based  methods  to  belief  space  breaks  this  assumption,  while 
FIRM  preserves  it.  Furthermore,  edge  independence  allows 
the  challenging  task  of  computing  collision  probabilities 
to  be  done  offline,  for  each  edge  separately,  without  per¬ 
forming  costly  computations  repeatedly  and  without  any 
simplifying  assumption. 

The  current  paper  draws  on  earlier  work  published  in 
conference  papers  (Agha-mohammadi  et  al.,  2011,  2012b, 
2013b).  Compared  with  Agha-mohammadi  et  al.  (2011), 
in  this  paper,  we  construct  the  FIRM  framework  more 
rigorously  by  detailing  the  procedure  of  transforming 
the  POMDP  problem  to  the  belief  semi-Markov  decision 
process  (SMDP)  problem,  and  then  to  the  FIRM  Markov 


decision  process  (FIRM  MDP)  problem,  where  the  policy 
on  the  graph  and  overall  hybrid  policy  generated  by  FIRM 
are  distinguished  clearly.  Also,  in  this  paper  we  provide  a 
clearer  distinction  between  the  abstract  FIRM  framework 
and  its  instantiations,  and  we  provide  more  rigorous  expla¬ 
nation  and  proofs  on  SLQG-FIRM.  Further,  we  append 
the  proofs  of  the  probabilistic  completeness  of  FIRM  to 
this  paper,  which  completes  the  work  in  Agha-mohammadi 
et  al.  (2012b).  We  also  present  new  unpublished  results  on 
the  performance  of  SLQG-FIRM  in  more  difficult  environ¬ 
ments,  and  demonstrate  its  real-time  planning  capabilities. 
Further,  we  provide  a  complexity  analysis  of  the  method 
and  compare  it  to  state-of-the-art  methods. 

The  outline  is  as  follows.  In  the  next  section,  we  review 
the  most  relevant  related  work.  Section  3  provides  an 
overview  of  the  method  and  its  contributions.  In  Section 
4  we  describe  the  general  problem  of  feedback  motion¬ 
planning  under  uncertainty,  present  notation,  and  formu¬ 
late  the  POMDP  problem.  In  Section  5,  we  present  the 
SLQG-based  instantiation  of  the  abstract  FIRM  framework 
by  providing  concrete  belief  samplers  and  connectors  (local 
planners).  In  Section  6,  assuming  the  existence  of  belief 
samplers  and  connectors,  we  introduce  the  abstract  FIRM 
framework  and  detail  the  process  of  transforming  POMDP 
to  a  FIRM  MDP.  In  Section  7,  aiming  at  evaluating  the  qual¬ 
ity  of  the  FIRM  solution,  we  extend  the  concepts  of  success 
and  probabilistic  completeness  to  the  stochastic  setting  and 
prove  the  probabilistic  completeness  of  the  FIRM  frame¬ 
work.  Experimental  results  are  presented  in  Section  8.  In 
Section  9,  we  discuss  limitations  of  the  framework,  future 
work,  and  open  issues.  Section  10  concludes  the  paper. 


2.  Related  work 

In  this  section  we  review  the  related  work  and  place  our 
work  into  context.  First,  we  review  the  related  work  on  plan¬ 
ning  algorithms  under  uncertainty,  and  then  we  consider  the 
work  concerning  probabilistic  completeness. 

2.7.  Planning  algorithms 

Uncertainty  in  robotic  systems  usually  stems  from  three 
sources:  (i)  motion  uncertainty,  which  results  from  the 
noise  that  affects  system  dynamics;  (ii)  sensing  uncer¬ 
tainty,  caused  by  noisy  sensory  measurements,  which  is  also 
referred  to  as  imperfect  state  information;  and  (iii)  uncer¬ 
tainty  in  the  environment  map,  such  as  uncertain  obstacle 
locations  or  uncertain  locations  of  features  (information 
sources)  in  the  environment. 

Methods  such  as  those  in  Missiuro  and  Roy  (2006), 
Guibas  et  al.  (2008),  and  Nakhaei  and  Lamiraux  (2008) 
deal  with  map  uncertainty.  However,  we  do  not  scruti¬ 
nize  these  methods,  since  we  assume  there  is  no  uncer¬ 
tainty  in  the  environment  map.  Methods  such  as  those  in 
Alterovitz  et  al.  (2007),  Melchior  and  Simmons  (2007), 
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and  Chakravorty  and  Kumar  (2009, 2011)  exploit  sampling- 
based  motion-planning  ideas  to  deal  with  motion  uncer¬ 
tainty.  However,  methods  that  are  most  related  to  FIRM 
consider  both  motion  and  sensing  uncertainties  in  planning, 
where  the  ultimate  goal  is  to  solve  a  POM  DP  problem, 
in  other  words,  to  find  the  best  policy  that  generates  opti¬ 
mal  actions  as  a  function  of  belief.  However,  due  to  the 
intractability  of  the  POM  DP  solution,  the  practical  results 
using  these  methods  are  usually  limited  to  problems  with 
small  sets  of  discrete  states  (Kaelbling  et  al.,  1998).  Point- 
based  POM  DP  solvers  such  as  those  in  Porta  et  al.  (2006), 
Kurniawati  et  al.  (2008),  Bai  et  al.  (2010),  and  Ong  et  al. 
(2010)  have  increased  the  size  of  problems  that  can  be 
solved  by  POM  DPs.  However,  they  do  not  handle  continu¬ 
ous  state,  control,  and  observation  spaces.  For  the  Gaussian 
belief  case,  Van  den  Berg  et  al.  (2011,  2012)  handle  con¬ 
tinuous  spaces  locally  around  a  given  trajectory  in  belief 
space.  Platt  et  al.  (2011)  generalize  the  local  approaches  to 
non-Gaussian  beliefs. 

In  continuous  state,  control,  and  observation  space,  the 
main  body  of  methods  does  not  follow  the  POM  DP  frame¬ 
work  due  to  its  extreme  complexity.  Instead,  these  methods 
return  a  nominal  path  as  the  solution  of  the  planning  prob¬ 
lem,  which  is  fixed  regardless  of  the  process  and  sensor 
noise  in  the  execution  phase.  Censi  et  al.  (2008)  propose 
a  planning  algorithm  based  on  graph  search  and  constraint 
propagation  on  a  grid-based  representation  of  the  space. 
Platt  et  al.  (2010)  plan  in  continuous  space  by  finding  the 
best  nominal  path  using  nonlinear  optimization  methods. 
In  the  linear  quadratic  Gaussian  motion  planning  (LQG- 
M  P)  method  (Van  den  Berg  et  al.,  2010),  among  the  finite 
number  of  RRT  paths,  the  best  path  is  found  by  simulat¬ 
ing  the  performance  of  LQG  on  all  RRT  paths.  Bry  and 
Roy  (2011)  propose  a  tree-based  approach,  in  which  the 
underlying  nominal  trajectory  is  optimized  using  RRT*. 
Vitus  and  Tomlin  (2011)  also  propose  an  approach  to  opti¬ 
mize  the  underlying  trajectory  by  formulating  the  problem 
as  a  chance-constrained  optimal  control  problem.  In  Van 
den  Berg  et  al.  (2011),  the  authors  also  extend  the  LQG- 
M  P  to  roadmaps.  Prentice  and  Roy  (2009)  and  Huynh  and 
Roy  (2009)  also  utilize  roadmap-based  methods  based  on 
the  PRM  approach,  where  the  best  path  is  found  through 
a  breadth-first  search  on  the  belief  roadmap  (BRM ).  How¬ 
ever,  in  all  these  roadmap-based  methods,  the  optimal  sub¬ 
structure  assumption  is  violated,  in  other  words,  the  costs 
of  different  edges  on  the  graph  depend  on  each  other.  The 
point-based  POM  DP  planner  in  Kurniawati  et  al.  (2012) 
takes  into  account  motion,  observation,  and  map  uncer¬ 
tainties,  and  advances  the  previous  point-based  methods 
by  introducing  guided  cluster  sampling.  It  starts  with  a 
roadmap  in  the  configuration  space,  and  grows  a  single¬ 
query  tree  in  the  belief  space,  rooted  in  the  initial  belief. 
Since  these  methods  return  a  nominal  path  instead  of  a 
feedback  policy,  the  path  needs  to  be  recomputed  (in  other 
words,  replanning  has  to  be  performed)  in  the  case  of  large 


deviations  or  when  starting  from  a  new  point.  However, 
unless  the  planning  domain  is  small  (e.g.  in  Platt  et  al., 
2010),  replanning  using  these  methods  is  computationally 
very  expensive.  The  reason  for  this  is  that  the  constructed 
planning  tree  depends  on  the  starting  belief,  and  therefore 
all  computations  needed  to  construct  the  tree  (including 
predicting  future  costs)  have  to  be  reproduced  from  the 
new  starting  belief.  BRM  ameliorates  this  expensive  com¬ 
putation  using  covariance  factorization  techniques,  but  it 
still  does  not  satisfy  the  optimal  substructure  assumption. 
Thus,  for  a  new  query  from  a  new  initial  point,  BRM  needs 
to  perform  the  search  algorithm  again.  In  the  presence  of 
obstacles,  recomputing  the  collision  probabilities  is  also 
needed,  which  makes  replanning  even  more  expensive.  In 
other  words,  these  methods  are  single-query,  in  the  sense 
that  the  edge  costs  are  computed  for  a  given  query. 

Since  these  methods  are  single-query,  online  replanning 
can  be  done  only  if  the  planning  domain  is  small  (e.g.  in 
Platt  et  al.,  2010)  or  if  the  planning  horizon  is  short,  such  as 
receding-horizon-control-based  (RHC-based)  approaches 
(e.g.  Chakravorty  and  Erwin,  2011).  The  method  proposed 
in Toit and  Burdick  (2010)  isan  RHC-based  method,  where 
the  nominal  path  is  updated  dynamically  over  an  N -step 
horizon.  The  PUMA  framework  proposed  in  He  et  al. 
(2011)  is  also  an  RHC-based  framework,  where  instead 
of  a  single  action,  a  sequence  of  actions  (macro-action)  is 
selected  at  every  decision  stage.  However,  these  methods 
entail  repeatedly  solving  open  loop  optimal  control  prob¬ 
lems  at  every  time  step,  which  is  computationally  very 
expensive  as  the  previous  computations  cannot  be  reused 
for  the  queries  from  the  new  initial  point.  In  FIRM ,  how¬ 
ever,  a  feedback  policy  (that  is,  a  mapping  from  belief  space 
to  actions)  is  computed  offline.  Thus  in  replanning  from  a 
new  initial  point,  the  computations  need  not  be  reproduced. 
Thus  for  a  fixed  goal,  the  algorithm  is  robust  to  changes 
in  the  start  point  of  the  query.  It  is  also  robust  to  changes 
in  the  goal  point,  because  graph  feedback  can  be  computed 
(see  Equation  (32))  online,  which  results  in  a  multi-query 
roadmap  in  the  belief  space. 

In  the  methods  that  account  for  sensing  uncertainty, 
the  state  has  to  be  estimated  based  on  measurements. 
To  handle  unknown  future  measurements  in  the  planning 
stage,  the  methods  in  Censi  et  al.  (2008),  Huynh  and  Roy 
(2009),  Prentice  and  Roy  (2009),  Platt  et  al.  (2010),  and 
Toit  and  Burdick  (2010)  consider  the  maximum  likelihood 
(M  L)  observation  sequenceto  predict  the  estimation  perfor¬ 
mance.  In  contrast,  FIRM  takes  all  possible  future  observa¬ 
tions  into  account  in  the  planning  stage.  The  methods  in 
Van  den  Berg  et  al.  (2010,  2011)  also  consider  all  possible 
future  observations. 

In  the  presence  of  obstacles,  due  to  the  dependency  of 
collision  events  on  each  other  in  different  time  steps,  it  is 
a  burdensome  task  to  include  the  collision  probabilities  in 
planning.  Thus,  the  methods  in  Censi  et  al.  (2008),  Van 
den  Berg  et  al.  (2010,  2011),  and  Toit  and  Burdick  (2010) 
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design  some  safety  measures  to  account  for  obstacles  in 
planning.  A  problem  with  some  of  these  collision  prob¬ 
ability  measures  is  that  they  are  built  on  the  assumption 
that  the  collision  probabilities  at  different  stages  along  the 
path  are  independent  of  each  other,  which  is  not  true  in 
general  and  may  lead  to  very  conservative  plans  (see  Fig¬ 
ure  2).  As  a  result,  different  methods  (e.g.  Patil  etal.,  2012) 
aim  at  providing  more  accurate  and  faster  ways  of  com¬ 
puting  collision  probabilities.  In  FIRM,  however,  collision 
probabilities  can  be  computed  and  seamlessly  incorporated 
into  the  planning  stage  without  making  any  simplifying 
assumptions. 


2.2.  Probabilistic  completeness 

Due  to  the  success  of  sampling-based  methods  in  many 
practical  planning  problems,  researchers  have  investigated 
the  theoretical  basis  for  this  success.  Flowever,  almost 
all  of  these  investigations  have  been  performed  for  algo¬ 
rithms  that  are  designed  for  planning  in  the  absence 
of  uncertainty.  The  literature  in  this  direction  falls  into 
two  categories:  path-isolation-based  methods  and  space- 
covering- based  methods. 

Path-isolation-based  analysis:  I  n  this  approach,  one  path 
is  chosen,  and  it  is  tiled  with  some  sets  such  as  e-balls 
(Kavraki  et  al .,  1998)  or  sets  with  arbitrary  shapes  but 
strictly  positive  measures  (Ladd  and  Kavraki,  2004).  Then 
the  success  probability  is  analyzed  by  investigating  the 
probability  of  sampling  in  each  of  the  sets  that  tile  the 
given  path  in  the  obstacle-free  space.  The  methods  in 
Svestka  and  Overmars  (1997),  Kavraki  etal.  (1998),  Bohlin 
(2002),  and  Ladd  and  Kavraki  (2004)  are  among  those 
that  perform  path-isolation-based  analysis  of  the  planning 
algorithm. 

Space-covering-based  analysis:  In  space-covering-based 
analysis,  an  adequate  number  of  sampled  points  needed  to 
find  a  successful  path  is  expressed  in  terms  of  a  parame¬ 
ter  e,  which  is  a  property  of  the  environment.  A  space  is 
e-good  if  every  point  in  the  state  space  can  be  connected 
to  at  least  an  e  fraction  of  the  space  using  local  planners. 
The  methods  in  Kavraki  et  al.  (1995)  and  H su  (2000)  are 
among  these. 

These  methods  were  developed  for  the  situation  where 
the  desired  result  from  the  planning  algorithm  is  a  path. 
FI  owever,  i  n  the  presence  of  uncertai  nty,  the  concept  of '  suc¬ 
cessful  path'  is  no  longer  meaningful,  because  on  a  given 
path,  different  policies  may  result  in  different  success  prob¬ 
abilities,  where  some  are  interpreted  as  successful  and  some 
are  not.  Thus,  since  the  planning  algorithm  returns  a  policy 
instead  of  a  path,  success  has  to  be  defined  for  a  policy.  This 
paper  extends  these  concepts  to  probabilistic  spaces,  that 
is,  to  sampling- based  methods  concerning  planning  under 
uncertainty.  In  Section  7,  we  define  and  formulate  the  con¬ 
cepts  of  successful  policy  and  probabilistic  completeness 
under  uncertainty  (PCUU). 


3.  M  ethod  overview  and  contributions 

The  highlights  and  contributions  of  this  paper  can  be 

divided  into  theoretical  and  practical  parts.  The  theoretical 

highlights  can  be  summarized  as  follows: 

•  Abstract  frameworks:  We  introduce  the  abstract  infor¬ 
mation  roadmap  (IRM)  framework  as  a  graph  in  the 
belief  space,  where  the  graph  nodes  are  beliefs  (rigor¬ 
ously  speaking,  small  subsets  of  the  belief  space)  and 
edges  are  local  controllers.  The  abstract  FIRM  frame¬ 
work  is  defined  as  an  IRM  where  local  controllers  are 
feedback  controllers.  These  abstract  frameworks  rely  on 
the  existence  of  an  appropriate  belief  node  sampler  and 
connector  (local  controller)  and  are  general  enough  to 
capture  any  form  of  belief.  Discussing  the  concept  of 
belief  reachability  under  feedback  controllers,  we  detail 
the  reduction  of  a  POM  DP  to  a  tractable  M  DP  on  the 
FIRM  graph,  referred  to  as  the  FIRM  MDP. 

•  SLQG-FIRM:  To  instantiate  a  FIRM,  we  need  con¬ 
crete  belief  samplers  and  connectors.  A  concrete  exam¬ 
ple  of  these  components  based  on  5LQG  controllers  is 
given  in  Section  5.  Basically,  it  is  shown  that  under  an 
SLQG  controller  the  belief  can  be  driven  into  the  e- 
neighborhood  of  the  sampled  Gaussian  beliefs  in  finite 
time,  and  thus  node  reachability  is  achieved.  In  this 
fashion,  SLQG-FIRM  addresses  the  hard  task  of  sam¬ 
pling  in  reachable  belief  space  that  is  required  in  belief- 
space  planning  (Spaan  and  Vlassis,  2005;  Pineau  et  al., 
2006;  Kurniawati  et  al.,  2010).  The  focus  of  SLQG- 
FIRM  i  s  on  ki  nemati  c  systems.  H  owever,  we  al  so  extend 
it  to  dynamical  systems  by  restricting  the  nodes  to  the 
equilibrium  space. 

•  Graph  (multi-query  roadmap)  in  belief  space:  FIRM  is 
the  first  framework  that  generates  a  graph  in  the  belief 
space  with  independent  edges.  In  other  words,  it  is  a 
multi-query  roadmap,  which  distinguishes  it  from  other 
methods  in  the  belief  space. 

•  Breaking  the  curse  of  history :  A  fundamental  contri¬ 
bution  of  FIRM  is  that  the  optimal  action  at  a  given 
node  does  not  depend  on  the  traversed  nodes,  actions, 
and  observations  prior  to  this  node,  in  other  words,  it 
is  independent  of  the  history  of  the  information  pro¬ 
cess  (see  Figure  1).  This  is  a  direct  consequence  of 
inducing  reachable  belief  nodes  using  feedback  con¬ 
trollers,  which  breaks  the  curse  of  history  in  POM  DPs. 
In  addition,  the  sampling-based  nature  of  the  method, 
borrowed  from  PRM ,  allows  us  to  ameliorate  the  curse 
of  dimensionality. 

•  Probabilistic  completeness:  Finally,  we  generalize  the 
conventional  concept  of  'probabilistic  completeness’ 
(which  is  defined  for  motion-planning  methods  in 
deterministic  environments)  to  the  concept  of  PCUU 
(which  is  defined  for  the  planners  in  the  presence  of 
uncertainty).  According  to  this  definition,  we  prove 
that  FIRM  is  a  probabilistically  complete  algorithm. 
M  oreover,  we  perform  an  analysis  on  the  absorption 
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(a)  (b) 

F  ig.  2.  The  dependence  of  col  I  i  si  on  events  on  each  other  at  d  ifferent  ti  me  steps.  x/<  i  s  the  robot  state  and  F  i  s  the  obstad  e  set  ( rectangl  es) . 
P(x^eF)  is  the  collision  probability  atthek-th  time  step.  Drawn  ellipses  are  3a  ellipses  of  Gaussian  distributions  obtained  by  Kalman 
filtering.  Also,  the  samples  in  M  onte  Carlo  simulation  are  shown  by  small  circles.  The  dark  ones  have  collided  with  obstacles  and  do  not 
get  propagated,  and  the  light  ones  are  the  safe  samples.  Although  the  overall  collision  probability  in  (a)  is  much  more  than  the  collision 
probability  in  (b),  simplified  safety  measures  based  on  the  el  I  ipse- obstacle  intersection  area  lead  to  the  same  safety  measure  in  (a)  and 
(b),  and  are  unable  to  capture  this  dependency. 


probability  of  the  local  planners  in  the  belief  space, 
which  provides  useful  general  tools  that  can  be  used  in 
analyzing  planning  methods  under  uncertainty. 

More  importantly,  FIRM  offers  a  set  of  practical  con¬ 
tributions,  which  we  believe  provides  an  important  step 
toward  utilizing  POM  DPs  as  a  practical  tool  for  robot 
motion  planning  under  uncertainty.  The  main  practical 
highlights  can  be  summarized  as  follows: 

•  Efficient  planning:  The  construction  of  FIRM  is  offline 
and  thus  online  planning  (and  replanning)  is  feasible 
and  almost  instantaneous. 

•  Robustness:  The  optimal  feedback  policy,  instead  of  a 
nominal  path,  is  computed  offline.  It  is  obtained  by  solv¬ 
ing  the  DP  problem  associated  with  the  FIRM  M  DP  on 
the  belief  graph.  As  a  result,  no  replanning  is  needed 
even  in  the  case  of  large  deviations  (or  just  local  real¬ 
time  replanning  is  sufficient),  and  the  feedback  over  the 
belief  space  can  take  care  of  deviations.  Therefore,  the 
method  is  robust  to  large  deviations.  It  is  also  less  sensi¬ 
tive  to  linearization  errors,  since  if  the  system  goes  out 
of  the  linearization  region  of  a  controller,  it  falls  into 
the  valid  linearization  region  of  some  other  controller 
(assuming  a  sufficient  number  of  FIRM  nodes)  that  can 
take  the  belief  and  drive  it  to  the  goal. 

•  Reliability  (incorporating  obstacles  in  planning):  In  the 
FIRM  framework,  collision  probabilities  can  be  com¬ 
puted,  which  leads  to  more  accurate  plans,  as  opposed 
to  simplified  collision  measures  that  may  lead  to  con¬ 
servative  plans  (see  Figure  2).  The  obstacles  add  a  fail¬ 
ure  node  to  the  FIRM  graph  into  which  the  robot  can 
be  absorbed.  Further,  due  to  the  offline  construction  of 
FIRM,  the  heavy  computational  burden  of  estimating 
collision  probabilities  can  be  done  offline. 

•  Scalability:  Belief-space  planners  usually  have  an  expo¬ 
nential  planning  complexity  either  in  the  number  of 
nodes  (if  they  are  sampling-based  methods)  or  in  the 


Vi,  (ffliNSUt  liolttc] 


Fig.  3.  Block  diagram  corresponding  to  the  problem  of  planning 
under  motion  and  sensing  uncertainty. 


size  of  grid  (if  they  rely  on  discretizing  the  environ¬ 
ment).  However,  the  complexity  of  the  FIRM  construc¬ 
tion  is  a  constant  multiplier  of  the  complexity  of  the 
PRM  construction.  Moreover,  the  complexity  of  plan¬ 
ning  (or  replanning)  with  FIRM  is  a  constant,  which  is 
independent  of  the  size  of  the  underlying  graph. 

4.  Problem  formulation 

The  main  sources  of  uncertainty  in  motion  planning  are  the 
lack  of  exact  knowledge  of  the  robot's  motion  model,  the 
robot's  sensing  model,  and  the  environment  model,  which 
are  referred  to  as  motion  uncertainty,  sensing  uncertainty, 
and  map  uncertainty,  respectively.  In  this  paper,  we  focus 
on  motion  and  sensing  uncertainty,  but  some  of  the  con¬ 
cepts  are  extensible  to  problems  with  map  uncertainty.  The 
M  DP  problem  and  the  POM  DP  problem  are  the  most  gen¬ 
eral  formulations,  respectively,  for  planning  problems  under 
motion  uncertainty  and  for  planning  problems  under  both 
motion  and  sensing  uncertainty. 

While  in  the  deterministic  setting,  we  seek  an  optimal 
path  as  the  solution  of  the  motion-planning  problem,  in  the 
stochastic  setting,  we  seek  an  optimal  feedback  (mapping) 
it  as  the  solution  of  the  motion-planning  problem.  In  the 
case  of  an  M  D  P,  n  is  a  mapping  from  the  state  space  to  the 
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control  space,  while  in  the  case  of  POM  DP,  ;r  isa  mapping 
from  the  belief  space  to  the  control  space  (see  Figure  3).  In 
the  rest  of  paper,  we  focus  on  PO  M  D  Ps. 


4.1.  Preliminaries  and  notation 

As  mentioned,  the  POM  DP  formulation  is  the  most  gen¬ 
eral  formulation  for  the  planning  problem  under  pro¬ 
cess  (motion)  uncertainty  and  imperfect  state  information 
(sensing  uncertainty).  POM  DPs  are  introduced  in  Astrom 
(1965),  Smallwood  and  Sondik  (1973),  and  Kaelbling  et  al. 
(1998).  In  the  following,  we  first  explain  different  elements 
of  the  POM  DP  problem,  and  then  present  a  form  of  the 
POM  DP  formulation  which  is  known  as  the  belief  M  DP 
problem  (Thrun  et  al.,  2005;  Bertsekas,  2007). 

State,  control,  and  observation:  Letx*;  e  X,  uk  e  U, 
and  zk  e  Z  denote  the  system  state,  control,  and  observa¬ 
tion  at  time  step  k,  respectively,  where  X  c  u  c  Rd", 
and  Z  c  are  the  state,  control,  and  observation  spaces. 
Scalars  dx,  du,  and  dz  are  the  state,  control,  and  observation 
dimensions  and  Rd  denotes  the  d-dimensional  Euclidean 
space. 

Basically,  system  state  x  encodes  all  information  needed 
for  decision-making  at  a  specific  time  instant.  It  is  worth 
noting  that  the  state  space  in  our  problem  is  continu¬ 
ous.  Control  space  U,  which  contains  all  possible  control 
inputs  (or  actions),  can  also  be  continuous,  and  u o±  := 
{u0,u i,...,uk}  denotes  the  control  history  up  to  step  k. 
Similarly,  the  observation  space  Z  that  contains  all  possible 
observations  (sensor  measurements)  can  also  be  continu¬ 
ous,  and  z0:k  :=  (z0,zi, . . .  ,zk)  is  the  observation  history  up 
to  step  k. 

State  evolution  model:  The  process  model  (or  the  motion 
model)  Xfc+i  =  f(xk,uk,wk)  describes  how  the  system  state 
evolves  as  a  function  of  the  applied  control  uk  and  the  pro¬ 
cess  (motion)  noise  wk,  which  is  distributed  according  to  the 
(known)  probability  density  function  (pdf)  p(  wUx^Ui;)-  An 
equivalent  representation  of  this  evolution  model  is  through 
the  transition  pdf  p(x'|x,  u) :  X  x  U  x  X  ^  K>0,  which 
encodes  the  probability  density  of  the  transition  from  state 
x  to  state  x!  under  the  control  u. 

Observation  (sensor)  model:  Although  xk  is  sufficient 
information  to  make  the  decision  (generate  control  uk),  in 
partially  observable  systems,  the  system  state  is  unknown 
and  the  only  available  data  for  decision-making  is  the 
imperfect  measurements  of  the  state  made  by  the  sensors. 
The  observation  model  zk  =  h(  xk,  vk)  encodes  the  relation 
between  system  state  xk  and  its  measurements  zk,  where  vk 
is  the  observation  noise  at  time  step  k,  which  is  distributed 
according  to  the  (known)  pdf  p(  v^|x^).  A  n  equivalent  repre¬ 
sentation  of  this  observation  model  is  through  the  likelihood 
pdf  p(z|x)  :XxZ->  R>0. 

Information  state  (belief):  In  partially  observable  envi¬ 
ronments,  the  available  data  for  decision-making  in  time 
step  k  is  the  history  of  observations  we  have  made,  z0:k,  and 
the  history  of  actions  we  have  taken,  u0:k-i.  Let  us  denote 


this  data  history  by  Hk  =  {z0:/c, il-  This  data  history 
can  be  compressed  to  a  conditional  probability  distribution 
over  all  possible  states,  that  is,  bk  =  p(xk\z0±-,  u0:k-i).  The 
pdf  bk  :  X  x  Zk  x  U^-1  M>o  is  called  the  information 
state  or  belief  at  the  kth  step.  B  denotes  the  belief  space  of 
the  problem,  containing  all  possible  beliefsb  e  B. 

Belief  evolution  model  (filter  model):  In  recursive  state 
estimation  techniques,  belief  can  be  computed  recursively. 
The  belief  evolution  model  (or  belief  dynamics)  introduced 
by  this  recursion  is  shown  by  function  r  :  B  x  U  x  Z  ->• 
B,  which  computes  the  next  belief  based  on  the  last  action 
and  current  observation  bk+i  =  z(bk,uk,zk+i).  This  belief 
evolution  model  can  be  derived  using  Bayes'  rule  and  the 
law  of  total  probability  (Thrun  etal.,  2005;  Bertsekas,  2007) 
as  follows: 

bk+ 1  =  p(zk+i\Hk,uk)-1p(zk+i\xk+i) 

p(xk+i\xk,uk)bkdxk  =:  r(bk,uk,zk+1)  (1) 

An  equivalent  representation  of  the  belief  evolution  model 
is  through  the  transition  pdf  p(  b'\b,  u):Bxtlxl^  R>0 
that  encodes  the  probability  density  of  the  transition  from 
belief  b  to  belief  b'  under  the  control  u. 

Policy:  In  a  partially  observable  system,  the  planner  n 
(also  called  the  policy  or  feedback  controller)  has  to  be  a 
function  that  returns  an  action  uk  given  the  available  data 
Uk.  However,  it  can  be  shown  that  the  compression  of  data 
Uk  to  belief  bk  preserves  all  the  information  needed  for 
decision-making  (Kumar  and  Varaiya,  1986).  Therefore,  a 
policy  7 r(-)  has  to  be  a  function  that  returns  an  action  uk 
given  the  belief  bk,  in  other  words,  7r(  ) :  B  U: 

uk  =  tt(  bk) ,  Vbk  e  B  (2) 

The  space  of  all  possible  7r(  )  is  denoted  by  n. 

Cost-to-go:  To  choose  an  optimal  policy,  we  need  to  have 
a  cost  function,  which  is  a  task-dependent  quantity.  But  let 
us  in  general  denote  the  one-step  cost  of  taking  action  u  at 
belief  b  by  c(  b,  u) :  B  x  U  M>0.  Then,  we  can  define  the 
cost-to-go  function  J  *(•) :  B  ->■  M>0  from  a  belief  bo  under 
the  policy  n  as 

oo 

r(t>o):=I>[c(^,7r(M)] 

k=0 

s.t.  bk+i  =  T(bk,n(bk),zk+i),  Zj;  ~ p( zk\xk)  (3) 

where  E[-]  is  the  expectation  operator.  Consider  a  goal 
region  Bgoal  c  B  such  that,  for  all  u,  we  have  c(b  e 
Bgoal,u)=  0;  in  other  words,  the  goal  region  is  cost¬ 
absorbing.  Then,  the  above  cost-to-go  would  be  finite  for 
a  policy  that  can  drive  the  state  to  the  goal  region  in  finite 
time. 

4.2.  POM  DP  problem 

Given  the  motion  model  f,  observation  model  h,  and  cost- 
to-go  J 7T,  the  POM  DP  problem  seeks  the  best  policy  that 
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minimizes  the  cost-to-go  function  from  every  belief  in  the 
belief  space.  Formally,  if  we  denote  the  optimal  cost-to-go 
function  by  J  (•),  we  can  define  optimal  policy  n(-) :  B 
U,  which  is  the  solution  of  POM  DP  as  follows: 

J  (•)  :=  minj  *(■)  (4) 

n 

n  —  arg  mm]  "[•)  (5) 

This  formulation  of  the  POM  DP  problem  is  also  known 
asth ebelief-MDP  problem  (Thrun  etal.,  2005;  Bertsekas, 
2007),  because  it  is  an  M  DP  over  the  belief  space. 

Dynamic  programming:  It  is  well  known  that  the  optimal 
cost-to-go  is  obtained  by  solving  the  following  stationary 
DP  equation  on  the  belief  space!  (Thrun  etal.,  2005;  Bert¬ 
sekas,  2007).  Subsequently,  the  solution  of  the  POM  DP  (i.e. 
tt)  can  be  computed  as  a  function  that  returns  the  argument 
of  this  minimization,  that  is,  returns  the  optimal  action  at 
every  belief: 

J  (b)  =  mi n{c( b, u) -J p(b'\b,u)J  (b')  db'},vb  e  B  (6a) 

u  J  i 

n(b)  =  arg  min{c(  b,  u)  +  f  p(b'\b,u)J  (b1)  db'},  vb  e  B 

(6b) 

However,  it  is  well  known  that  this  DP  equation  is  exceed¬ 
ingly  difficult  to  solve  since  it  is  defined  over  the  entire 
belief  space  and  suffers  from  the  curse  of  history  (Pineau 
etal.,  2003)  and  the  curse  of  dimensionality. 

Constrained  POM  DP  problems:  The  presence  of  con¬ 
straints  makes  this  problem  even  more  difficult.  We  denote 
the  constraint  set  (or  the  failure  set)  in  the  state  and  control 
space  by  F  c  X  x  TJ,  which  needs  to  be  avoided  by  the 
system,  in  other  words,  (x^uU  £  F ,  for  all  k. 

4.3.  Problem  description 

We  aim  at  constructing  a  sampling-based  solution  to  the 
belief  M  DP  problem.  The  main  goals  of  this  paper  are  as 
follows. 

SLQG-based  FIRM  :  First,  we  construct  a  roadmap  in 
belief  space  utilizing  SLQG  controllers  as  belief  stabilizers. 
We  perform  this  construction  for  a  certain  class  of  systems, 
and  show  that  the  belief  reachability  condition  is  guaran¬ 
teed.  In  designing  SLQG-FIRM  ,  we  first  focus  on  kinematic 
systems  (satisfying  x  =  f(x,  0,0)).  Then,  using  the  notion 
of  equilibrium  space  and  restricting  the  sampling  to  this 
space,  we  apply  the  method  to  dynamical  systems  as  well. 

General  FIRM  framework:  After  studying  the  concrete 
SLQG-FIRM  example,  we  consider  the  more  general  case, 
where,  for  a  general  system,  assuming  that  there  exists  a 
controller  under  which  belief  reachability  is  guaranteed,  we 
(i)  construct  a  graph  in  the  belief  space  encoding  the  fail¬ 
ure  probabilities  on  its  edges,  (ii)  reduce  the  intractable 
belief  MDP  in  Equation  (4)  into  a  tractable  MDP  prob¬ 
lem  on  this  graph,  and  (iii)  compute  a  feedback  solution  on 
this  graph. 


5.  SLQG-FIRM 

I  n  this  section,  we  discuss  a  particular  instance  of  the  F I R  M 
framework  in  which  belief  reachability  is  accomplished  by 
SLQG  controllers.  In  Section  6,  we  propose  the  general 
FIRM  framework. 

We  start  this  section  by  restricting  our  attention  to  the 
class  of  systems  that  SLQG-FIRM  can  handle.  Then,  we 
present  a  brief  review  of  LQG  controllers  and  address  how 
we  can  define  nodes  in  the  belief  space  to  satisfy  reachabil¬ 
ity  using  SLQG  controllers.  Next  we  explain  the  procedure 
of  constructing  local  controllers  (i.e.  FIRM  edges)  and  the 
SLQG-based  FIRM  graph.  Finally,  we  compute  transition 
probabilities  and  costs  associated  with  each  graph  edge  and 
compute  the  graph  feedback. 

5.1.  Preliminaries  on  SLQG 

In  this  section,  we  assume  the  noise  is  Gaussian,  and 
we  start  by  defining  the  notation  needed  in  dealing  with 
Gaussian  beliefs. 

Gaussian  belief  space:  We  denote  the  random  estima¬ 
tion  vector  by  x+,  whose  distribution  is  bk  =  p(x£)  = 
pUfclzo-fc.uo-fc-i),  and  denote  the  mean  and  covariance  of 
x+  by  x+  =  E[x+]  and  P  =  E[(x+  -x+)(x+  -x+)T], 
respectively.  Denoting  the  Gaussian  belief  space  by  GB, 
every  function  b(-)  e  GB  can  be  characterized  by  a  mean- 
covariance  pair  (3+,  P ).  A  busing  the  notation,  we  also  show 
this  pair  by  b  =  (x+,P) e  Rn  x  S|,  where  the  mean  vec¬ 
tor  belongs  to  the  n-dimensional  Euclidean  space  Rn  and 
the  covariance  matrix  belongs  to  the  space  of  all  positive 
semi-definite  n  x  n  matrices  §" . 

LOG  controllers:  An  LQG  controller  is  composed  of 
a  Kalman  filter  (KF)  as  the  state  estimator  and  a  linear 
quadratic  regulator  (LQR)  controller  (see  Figure  3).  Thus, 
the  belief  dynamic  bk+i  =  r(bk,Uk,Zk+i)  is  known  and 
comes  from  the  Kalman  filtering  equations,  and  the  con¬ 
troller  Uk  =  p,(bk)  that  acts  on  the  belief  comes  from  the 
LQR  equations.  Considering  a  quadratic  cost  for  state  error 
and  control  error,  LQG  is  an  optimal  controller  for  linear 
systems  with  Gaussian  noise  (Bertsekas,  2007).  However, 
it  is  also  often  used  for  stabilization  of  nonlinear  systems 
around  a  given  trajectory  or  around  a  given  point. 

Stationary  and  time-varying  LQG :  Time-varying  LQG  is 
designed  to  track  a  given  trajectory,  in  which  at  every  time 
step  a  different  feedback  policy  is  utilized.  SLQG  is  a  time- 
invariant  policy,  in  which  LQG  is  designed  around  a  given 
point,  say  v,  to  steer  the  state  of  the  system  to  v  (Bertsekas, 
2007).  In  Appendices  B  and  C  we  review  these  controllers 
in  detail. 

Equilibrium  space:  Let  us  denote  a  configuration  of  a 
robotic  system  (Lozano-Perez,  1983)  by  q.  Kinematic  mod¬ 
els  are  specified  in  terms  of  the  configuration  variable  q, 
while  dynamical  models  are  specified  by  the  state  x  = 
(q,q),  where  q  denotes  the  corresponding  velocities.  In 
SLQG-FIRM  ,  we  sample  the  underlying  PRM  nodes  (sta¬ 
bilizer  parameters)  from  the  configuration  space.  Thus,  for 
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dynamical  systems,  we  impose  the  condition  q  =  0  on  the 
samples,  in  other  words,  we  sample  from  the  equilibrium 
space  of  the  system,  which  is  denoted  by  X  in  this  paper. 

Remark  1.  FIRM  can  be  generalized  to  cases  that  do  not 
need  to  sample  in  equilibrium  space.  For  example,  in  sys¬ 
tems  such  as  fixed-wing  aircraft,  the  system  cannot  reach 
the  zero  velocity  q  =  0.  In  such  cases,  SLQG  is  not  a 
suitable  choice  and  one  needs  to  design  more  appropri¬ 
ate  controllers,  such  as  periodic  controllers  as  detailed  in 
Agha-mohammadi  et  al.  (2012c,  2013a).  In  such  a  case, 
we  sample  periodic  maneuvers  as  FIRM  nodes.  In  other 
words,  we  go  from  periodic  trajectory  to  periodic  trajec¬ 
tory  instead  of  going  from  point  to  point  ( Agha-mohammadi 
et  al.,  2012c,  2013a). 


where  Ss  is  the  solution  of  the  following  discrete  algebraic 
Riccati  equation  (DARE): 

Ss  =  Wx  +  A]SsAs  -  Aj.S’sB,(  bJ.S'sB,.  +  W„)_1  Bj.S’sAs 

(12) 

Controllable  and  observable  pairs:  Consider  an  n  x 
n  matrix  A.  A  pair  of  matrices  (A,B)  is  called  a 
controllable  pair  if  the  controllability  matrix  £  = 

[B,  AB,  A2B, . . . ,  AW_1B]  has  rank  n  (Bertsekas,  2007).  A 
pair  of  matrices  (A, H)  is  called  observable  if  the  pair 
(  At,Ht)  is  controllable  (Bertsekas,  2007). 

Controllable  and  observable  systems:  Let  us  also  define 
the  matrices  Q  and  Wx  such  that  GQGT  =  QQT,  Wx  = 
WjWx.  We  next  consider  a  class  of  linear  systems  and 
quadratic  cost  weights  that  satisfy  the  following  property. 


5.2.  Belief  stabilizers 

In  SLQG-FIRM  nodes,  we  use  SLQG  controllers  as  belief 
stabilizers,  that  is,  as  a  tool  to  reach  (stabilize  to)  a  prede¬ 
fined  belief  (FIRM  node).  To  explain  how  SLQG  works  as 
a  belief  stabilizer,  consider  a  fixed  point  v  e  X  in  the  state 
space  and  consider  the  following  linear  (linearized)  system 
about  v: 

xM=  Axk  +  B uk  +  G wk,  wk  ~  J\f(  0,  Q)  (7a) 

zk=  Uxk  +  vk,  vk  ~  Af(  0,  R)  (7b) 


Property  1.  Pairs  ( A,  B)  and  ( A,  Q)  are  controllable  pairs, 
and  pairs  ( A,  H)  and  ( A,  W)  are  observable  pairs. 

In  the  following,  we  present  three  lemmas,  through  which 
we  can  construct  reachable  SLQG-FIRM  nodes  for  the  sys¬ 
tems  that  satisfy  Property  1.  However,  approaches  such 
as  periodic  LQG  (PLQG)-based  FIRM  (Agha-mohammadi 
et  al.,  2012c)  or  dynamic  feedback  linearization  (DFL)- 
based  FIRM  (Agha-mohammadi  et  al.,  2012a)  extend  this 
class  of  systems  by  excluding  the  controllability  part  in 
Property  1,  and  thus  consider  a  broader  class  of  systems. 


SLQG  controller:  The  goal  of  the  SLQG  controller 
designed  about  v  is  to  keep  the  state  as  close  as  possible 
to  the  desired  point  v  and  also  keep  the  energy  consumed  at 
a  reasonable  level.  More  rigorously,  SLQG  minimizes  the 
following  quadratic  cost: 

J  =  E  I  F3Xk  -  V)T  _  v)  i  (8) 

n>o  j 


where  Wx  and  WM  are  positive-definite  weight  matrices  that 
are  defined  by  the  user.  In  Appendix  C,  the  SLQG  controller 
minimizing  the  above  cost  is  discussed  in  detail.  However, 
in  brief,  the  belief  propagation  and  control  generation  is 
carried  out  as  follows: 


bk+ 1  = 


A xf  +  B uk  +  Kk+i(zk+i  -  H(  +  B uk)) 

(/  -  K*+1H)  (  AP+At  +  GQGt) 


—  T  (  bk ,  Uk ,  Zk+ 1 ) 


(9) 


where  is  called  the  Kalman  gain  at  the  kth  time  step  and 
is  computed  as  follows: 


K*+1  =  (  AP+At  +  GQGt)Ht(H(  AP+At 

+  GQGt)  Ht  +  MRMt)“'  (10) 


Lemma  1.  Consider  the  SLQG  controller  designed  to  drive 
the  state  of  the  system  in  Equation  (7)  to  a  point  v  e  X. 
Given  that  Property  1  is  satisfied,  in  the  absence  of  a  stop¬ 
ping  region,  the  belief  bk  under  SLQG  controller  converges 
to  a  unique  stationary  belief  bs,  in  distribution  (i.d.).  In 
other  words,  the  distribution  over  belief  converges  to  a 
unique  distribution.  That  is, 

bk^bs~AT(bc,  c)  (13) 

Note  that  bk  is  a  random  belief  that  converges  to  another 
random  belief  bs.  In  the  Gaussian  setting,  the  distribution 
over  the  random  belief  bs  is  A f(  bc ,  C),  where  bc  =  E[^]  = 
(\,PS).  The  stationary  estimation  covariance  matrix  Ps 
is  characterized  in  Lemma  2,  and  the  covariance  C  is 
characterized  in  Appendix  C. 

Proof.  In  Appendix  C,  we  review  the  SLQG  and  prove 
Lemma  1.  □ 

Lemma  2.  Given  Property  1,  the  following  DARE  has 
a  unique  symmetric  positive -definite  solution  ( Bertsekas , 
2007),  denoted  by  Pf: 

p;  =  gqgt  +  A(p;  -  p;ht(H/>;ht  +  Rr1  up;) at 

(14) 


The  control  signal  is  generated  using  a  stationary  feedback 
gain  Ls: 


uk  =  — L,(x+  —  v)  = :  ii( bk) ,  L,  =  (BjS^+WJ-1  BjS’,A.5 

(ID 


Moreover,  the  stationary  covariance  matrix  Ps  introduced 
in  Lemma  1  is  computed  as: 

ps  =  p;  -  p;ht(  hp;ht  +  R)-1  up;  (15) 
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Proof  See  Appendix  C  or  Bertsekas  (2007).  □ 

Now  we  state  the  main  result,  through  which  we  can 
construct  reachable  FIRM  nodes  under  SLQG-based  belief 
stabilizers: 

Lemma  3.  Consider  the  SLQG  controller  designed  to  drive 
the  state  of  the  system  in  Equation  (7)  to  a  point  v  e  X. 
Suppose  matrix  H  is  full  rank  and  Property  1  is  satisfied. 
Then,  any  set  B  C  1  whose  interior  contains  bc  =  ( v,  Ps) 
is  reachable  under  the  designed  SLQG  controller  start¬ 
ing  from  any  Gaussian  distribution.  Moreover,  the  estima¬ 
tion  covariance  Pk  converges  to  the  unique  deterministic 
stationary  covariance  Ps. 

Proof.  See  Appendix  D.  □ 

Therefore,  based  on  Lemma  3,  SLQG  can  accomplish  the 
belief  reachability  for  appropriately  chosen  region  B.  In  the 
next  subsection  we  explicitly  characterize  region  B. 

5.3.  Designing  SLQG-FIRM  nodes 

Underlying  PRM:  As  mentioned,  to  construct  a  FIRM  we 
first  construct  an  underlying  PRM  (Kavraki  et  al.,  1996).  In 
the  SLQG-FIRM,  nodes  of  the  underlying  PRM,  denoted 
by  {V}^,  are  sampled  from  the  obstacle-free  space.  Con¬ 
sidering  linear  systems  or  nonlinear  systems  that  are  locally 
well  approximated  by  linearization,  we  linearize  the  system 
about  every  PRM  node.  Let  us  denote  the  linear  (linearized) 
system  about  v7  as  follows: 

Xk+ 1=  A jXk  +  B iuk  +  Qwk,  Wk  ~  Af(  0,  Q7)  (16a) 

Zk—  H 'xk  +  vk,  vk  ~  Af(  0,  R;)  (16b) 

where  Wk  and  Vk  are  motion  and  measurement  noise,  respec¬ 
tively,  drawn  from  zero-mean  Gaussian  distributions  with 
covariances  Q7  and  R7  . 

FIRM  nodes:  To  design  the  jth  FIRM  node  B] ,  we  first 
design  the  SLQG  controller  p’s  (see  Equations  (9)  and  (11)) 
corresponding  to  the  system  in  Equation  (16).  The  con¬ 
troller  p’s  is  called  the  jth  node  controller  or  the  jth  belief 
stabilizer.  Given  Property  1 ,  based  on  Lemma  1 ,  the  limit¬ 
ing  random  belief  tis  =  (x^7,/^)  exists,  and  and  P’s  are 
the  stationary  estimation  mean  and  covariance,  respectively. 
Note  that  under  SLQG,  is  a  random  variable  and  P’s  is 
a  deterministic  matrix.  Moreover,  in  Lemma  1,  it  is  shown 
that  tic  =  IEfM]  =(  vfP’s),  where  P’s  is  shown  to  be  unique 
and  computed  in  Lemma  2. 

Thus,  we  can  characterize  the  jth  node  center: 

tic^(vi,P{)  (17) 

As  a  result,  considering  B P  as  a  ball  with  an  arbitrary  radius 
6  >  0  centered  at  tic ,  the  pair  ( Bi ,  pJs)  is  a  proper  pair ,  based 
on  Lemma  3;  in  other  words,  Bi  is  reachable  under  pJs.  Thus, 
one  can  define  the 7th  FIRM  node  as  if  =  {b  :  \\b  —  tic\\h< 
5},  where  ||  •  \\h  denotes  a  suitable  norm  in  belief  space  and 


8  defines  the  FIRM  node  size.  A  typical  example  of  such 
a  FIRM  node  in  Gaussian  belief  space  can  be  defined  by 
considering  mean  and  covariance  separately: 

ff  =  {b  =  (x,P)  :  ||* -vi  <8u\\P-Pjs\\m<82}  (18) 

where  8 1  and  82  are  suitably  small  thresholds  that  determine 
the  size  of  FIRM  node  if.  Moreover,  ||  •  ||  is  a  suitable  vector 
norm  and  ||  *  ||m  is  a  suitable  matrix  norm.  We  denote  the  set 
of  all  SLQG-FIRM  nodes  as  ¥  =  {Bf. 

5.4.  Designing  SLQG-FIRM  edges 

A  FIRM  edge  is  actually  a  local  planner  (local  feedback 
controller).  In  SLQG-based  FIRM,  the  local  controller  rep¬ 
resenting  the  ( ij) th  edge  is  denoted  by  /z77.  The  role  of  /z77 
is  to  drive  the  belief  from  the  node  Bl  to  the  node  if .  Based 
on  Lemma  3,  for  a  linear  system,  if  we  choose  pij  —  /4, 
as  was  done  in  Agha-mohammadi  et  al.  (201 1),  the  node  Bi 
is  reachable  under  /z77.  However,  to  better  cope  with  nonlin¬ 
earities,  we  construct  the  local  controller  /z77  by  preceding 
the  node  controller  with  a  time-varying  LQG  controller  ~pk, 
which  is  called  an  edge  controller  here.  Time- varying  LQG 
controllers  are  described  in  detail  in  Appendix  B. 

PRM  edge :  To  design  edge  controllers,  first  the 
underlying  PRM  edges,  denoted  by  £  =  {e77},  have  to  be 
constructed.  For  kinematics-based  models  there  are  many 
different  methods  in  the  PRM  literature  to  construct  such 
edges.  For  dynamical  models,  there  are  fewer  choices.  A 
few  examples  are  in  Van  den  Berg  and  Overmars  (2007) 
and  Agha-mohammadi  et  al.  (2012c). 

Edge  controllers:  An  edge  controller  jll]k  in  SLQG-FIRM 
is  built  by  linearizing  the  system  along  the  (ij) th  PRM 
edge  e77  and  designing  a  time-varying  LQG  controller  to 
track  it  (see  Appendix  B).  The  edge  controller  has  two 
major  roles.  First,  it  tries  to  track  the  PRM  edge  and  thus 
exploits  the  available  information  on  the  PRM  edges,  such 
as  some  clearance  from  the  obstacles.  Second,  in  the  case 
where  the  neighboring  PRM  nodes  are  not  close  to  each 
other,  it  takes  the  belief  into  the  valid  linearization  region 
of  the  jth  belief  stabilizer,  where  it  hands  the  system  over 
to  the  belief  stabilizer,  and  the  belief  stabilizer  in  turn  takes 
the  system  to  the  jth  FIRM  node. 

Local  controllers:  Thus,  overall,  the  (ij) th  local  con¬ 
troller  /z77  is  the  concatenation  of  the  ( /,  j)th  edge  controller 
]IlJk  and  jth  node  controller  (belief  stabilizer)  p’s.  We  denote 
the  set  of  all  SLQG-FIRM  edges  by  M  =  {/z77}  and  the  set 
of  all  SLQG-FIRM  edges  originating  from  Bl  by  M(  i). 

SLQG-FIRM:  Formally,  we  define  SLQG-FIRM  as  a 
graph  with  the  set  of  nodes  V  =  {B1}  and  the  set  of  edges 
(or  local  controllers)  M  =  {/z77}.  The  set  of  controllers 
originating  from  Bl  is  denoted  by  M(  i)  C  M. 

5.5.  Transition  probabilities  and  edge  costs 

To  find  a  feedback  on  a  FIRM  graph,  we  need  to  com¬ 
pute  the  cost  associated  with  the  graph  edges.  Moreover,  we 
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include  the  constraint  set  F  into  the  planning  with  FIRM 
by  computing  the  probability  of  violating  the  constraint 
(x,  u)$l  F  along  the  graph  edges.  Let  us  denote  the  cost 
of  taking  controller  fF  at  node  Bl  by  C8(B\  /F7).  Super¬ 
script  g  refers  to  the  ‘global’  (or  ‘graph-level’)  quantities,  as 
these  quantities  are  used  to  find  the  global  policy  (or  policy 
on  the  graph).  Similarly,  let  ¥8(ff\B\  /F7)  and  ¥8(F\B\  p}j) 
denote  the  probabilities  of  the  transition  to  B>  and  F  under 
/F7,  respectively.  These  quantities  are  rigorously  defined  in 
Section  6  and  their  connection  with  the  original  POMDP 
is  established.  However,  in  this  subsection,  we  just  give  an 
example  of  how  such  costs  and  transition  probabilities  can 
be  computed. 

Transition  probabilities :  Computing  transition  proba¬ 
bilities  ¥8(-\ B\p}i)  in  general  can  be  computationally 
expensive.  Here,  we  utilize  particle-based  methods  to 
approximate  the  distributions  and  thus  compute  the  colli¬ 
sion  probabilities.  Basically,  we  can  approximate  the  failure 
and  reachability  probabilities  based  on  the  number  of  par¬ 
ticles  that  violate  the  constraints  (hit  the  set  F)  and  based 
on  the  number  of  particles  that  can  reach  the  target  node 
(hit  the  set  B).  The  method  is  described  in  more  detail  with 
the  experiments  in  Section  8.1.4.  The  dependency  of  col¬ 
lision  events  on  each  other  in  different  time  steps,  which 
is  ignored  in  most  collision  probability  computation  meth¬ 
ods  in  the  POMDP  literature,  can  be  taken  into  account 
rigorously  in  particle-based  methods.  Owing  to  the  offline 
construction  of  FIRM,  the  high  computational  burden  of 
particle-based  approaches  can  be  tolerated.  However,  any 
other  method  for  computing  transition  probabilities  can  also 
be  adopted,  such  as  that  in  Patil  et  al.  (2012). 

Edge  costs :  The  FIRM  edge  costs  in  general  and  their 
derivation  based  on  the  one- step  costs  of  the  original 
POMDP  problem  are  defined  are  Section  6.  However, 
roughly  speaking,  we  can  define  the  cost  C8(B\  /F7)  as  the 
sum  of  all  one-step  costs  along  the  edge  until  the  system 
reaches  the  target  node  Bi  or  hits  the  failure  set  F .  Depend¬ 
ing  on  the  application,  one  can  define  a  variety  of  cost 
functions.  Here,  we  form  a  cost  function  based  on  a  linear 
combination  of  the  estimation  accuracy  and  edge  traverse 
time.  This  cost  function  aims  to  find  paths  for  which  the 
estimator  (and  hence  the  controller)  can  perform  well,  and 
also  to  find  faster  paths.  An  indicator  of  estimation  error  is 
the  trace  of  estimation  covariance.  Thus,  we  define  O'7  = 
MTl=i  tr(  Pf)  1  along  the  edge.  In  SLQG,  the  covariance 
matrix  evolves  deterministically  and  thus  the  expectation 
operator  can  be  omitted.  However,  if  the  filter  of  choice  in 
the  edge  controller  is  the  extended  Kalman  filter  (EKF),  the 
covariance  matrix  evolution  is  also  stochastic,  and  this  mea¬ 
sure  can  take  into  account  its  stochasticity.  Let  us  denote  the 
mean  stopping  time  under  controller  /F7  as  Tij .  Then,  the 
total  edge  cost  is  considered  as  a  linear  combination  of  esti¬ 
mation  accuracy  and  expected  stopping  time,  with  suitable 
coefficients  a\  and  c^: 

C8{B\p}j)=  oti<Z>ij  +  a2?ij  (19) 


5.6.  Graph  feedback  on  SLQG-FIRM 

Graph  policy :  Graph  policy  n8  :  V  — >  M  is  a  function  that 
returns  an  edge  (local  controller)  for  any  given  node  of  the 
graph.  We  denote  the  space  of  all  graph  policies  by  n-L  To 
choose  the  best  graph  policy  in  II8  we  define  the  optimal 
graph  cost-to-go  J8  from  every  graph  node. 

Graph  cost-to-go\  The  cost-to-go  from  a  given  node 
Bl  is  equal  to  the  cost  of  the  next  taken  controller,  that 
is,  C8(Bl,7t8(B1)),  plus  the  expected  cost-to-go  from  the 
next  node  or  from  the  failure  set.  In  other  words,  the  DP 
equations  for  this  graph  are 

J8(Bl)  =  mmC8{Bi,p}j)+J8{F)¥8{F\Bi,p}j) 

M(f) 

+  J8(Bj)¥8(Bi\Bi ,  p}j)  (20a) 

7rS(flf)  =  argmin  C8(B\p}j)  +J8(F)F8(F\Bi,  p}j) 

M(0 

+  J8(Bj)¥8(Bi\Bi ,  p}j)  (20b) 

in  which  J( F)  is  a  suitably  high  user-defined  cost-to-go  for 
hitting  the  obstacles.  The  cost-to-go  from  goal  node  Bgoal  is 
defined  to  be  zero,  in  other  words,  J8(Bgoal)  =  0. 

Solving  SLQG-FIRM  DP:  The  DP  in  equation  (20)  is  a 
tractable  DP  as  it  is  defined  on  a  finite  number  of  graph 
nodes.  Computing  the  transition  costs  and  probabilities 
offline,  this  DP  can  be  solved  online  using  standard  tech¬ 
niques,  such  as  value/policy  iteration  methods,  for  any  sub¬ 
mitted  query.  As  a  result,  FIRM  is  indeed  a  multi-query 
roadmap  in  belief  space.  Moreover,  if  the  goal  node  is  fixed 
and  only  the  starting  point  of  the  query  changes,  then  this 
DP  can  be  solved  offline  and  tv 8  can  be  stored  as  a  look-up 
table. 

Offline  construction  of  SLQG-FIRM  :  Algorithm  1  details 
the  construction  of  SLQG-FIRM  with  a  given  goal  node. 


5. 7.  Planning  with  SLQG-FIRM  ( query  phase ) 

Given  that  the  FIRM  graph  is  computed  offline,  the 
online  phase  of  planning  (and  replanning)  on  the  roadmap 
becomes  very  efficient,  and  thus  feasible  in  real  time.  In 
this  section,  we  assume  that  the  goal  node  is  fixed  and  we 
just  input  the  start  point  as  the  query.  However,  as  discussed 
in  the  previous  subsection,  one  can  easily  submit  queries 
with  different  goal  locations  by  solving  the  DP  online.  If 
the  initial  belief  bo  of  the  submitted  query  does  not  belong 
to  any  B\  we  create  a  singleton  set  Bo  =  {^o}  as  the  ini¬ 
tial  FIRM  node.  To  connect  Bo  to  the  FIRM  graph,  we 
go  back  into  the  state  space,  where  the  underlying  PRM  is 
constructed.  There,  we  add  a  new  PRM  node  to  the  graph 
Vo,  which  is  the  expected  value  of  the  robot  state,  in  other 
words,  Vo  =  IE[xo].  Then,  we  connect  Vo  to  the  underlying 
PRM  graph  based  on  the  connecting  function  of  the  adopted 
PRM.  We  denote  the  set  of  newly  added  edges  originating 
from  vo  by  £ ( 0).  Then,  corresponding  to  each  edge  in  £ ( 0), 
we  design  a  local  controller  and  call  the  set  of  them  M(  0). 
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Algorithm  1:  Offline  construction  of  SLQG-FIRM 


1  input :  Free  space  map,  Xfree 

2  output :  FIRM  graph  Q 

3  Sample  PRM  nodes  V  =  (v-'}Jw=1  and  construct  its  edges 

s  =  m, 

4  forall  the  PRM  nodes  vJ  e  V  do 

5  Design  the  node  controller  (SLQG)  /xys  about  the 
nodev'  using  Equation  (11); 

6  Compute  associated  bic  using  Equation  (17); 

7  Construct  FIRM  node  S'  using  Equation  (18); 


8  Construct Y  =  {S'}; 

9  forall  the  PRM  edges  e  £  do 

10  Design  the  edge  controller  (time- varying  LQG) 
along  the  edge  &  (detailed  in  A  ppendix  B); 

n  Construct  the  local  controller  /xij  by  concatenating 


12 
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14 


edge  controller  ~rfj  and  node  controller  /i‘s; 

Set  b0  =  bj.; 

Generate  sample  belief  paths  b0:r  and  ground  truth 
paths  x0:r  induced  by  controller  /i9  invoked  at  S'; 
Compute  the  transition  probabilities  P9(F  |6',  pi1) 
and  P9(B'|B', /i9)  and  transition  cost  Cg(Bi,pJi); 


is  Construct M  =  {/i9}; 

16  Compute  the  cost-to-go  J 9  and  feedback  over  the 
FIRM  nodes  by  solving  the  DP  in  Equation  (20); 

17  Q  =  (V,  MJ  9,7t9); 

is  return  Q\ 


Finally,  we  choose  the  best  initial  controller  among  the  local 
controllers  in  M(  0)  using 

Ho(-)  =  arg  min{C9(6c, p) 

lie  M(0) 

+  P9( B( n)  |60,  fi)J  g(B(^))  +P9( F  |6 o, n)J  g(F)} 

(21) 

where  6  ( /x)  is  the  target  node  of  the  control  ler  p.  U  nder  the 
controller  p*Q,  belief  evolves  and  enters  one  of  F I RM  nodes, 
if  no  collision  occurs.  From  this  FIRM  node,  a  combination 
of  the  global  graph  policy  7r9  and  the  local  edge  policies 
{/i9}  can  take  the  belief  to  the  goal  node,  as  explained 
below. 

Merging  global  and  local  feedbacks:  After  computing  a 
global  graph  feedback  ng  and  local  edge  feedbacks  [pg], 
we  can  construct  a  full  feedback  x r.  Actually,  at  every  time 
instance,  xz  is  equal  to  one  of  the  local  feedbacks,  which 
is  chosen  by  the  global  feedback  in  the  last  visited  node.  In 
other  words,  given  the  current  FIRM  node,  we  use  pol  icy  ng 
defined  on  FIRM  nodes  to  find  /X*  and  pick  p*  to  move  the 
robot  into  6 ( p*).  This  process  is  continued  until  the  system 
reaches  the  goal  region  or  hits  the  failure  set.  Algorithm  2 
illustrates  this  procedure. 

Kidnapped  robot  problem:  In  robotics,  the  kidnapped 
robot  problem  commonly  refers  to  a  situation  where  an 
autonomous  robot  in  operation  is  carried  to  an  arbitrary 


Algorithm  2:  Online  phase  algorithm  (planning  or 
replanning  with  SLQG-FIRM ) 


1  input :  Initial  belief  bo- FIRM  graphs 

2  if  3 S'  e  Y  such  that  bo  e  S'  then 

3  |  compute  piJ  =  xrg(  S'); 

4  else 

5  Compute  v0  =  E[x0]  based  on  b0,  and  connect  v0 
to  the  PRM  .  Let  £(  0)  denote  the  set  of  outgoing 
edges  from  v0; 

6  Set  60  =  {b0};  design  local  controllers  associated 
with  edges  in  £[  0).  Call  the  set  of  these  local 
controllers  M(  0); 

7  forall  the  p  e  M(0)  do 

8  Generate  sample  belief  paths  b0:r  and  ground 
truth  paths x0:r  induced  by  controller  p 
invoked  at  b0; 

9  Compute  the  transition  probabilities 
P^F  \B0,p)  and  ¥g(B(p)  |60,/i)  and 

_  transition  costs  C9(  60,  m); 
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Set  /  =  0  and  choose  the  best  initial  local  controller 
/I1  within  the  setM(  0)  using  Equation  (21); 


n  whiles'  /  Bgoal  do 

Set  B>  as  the  target  node  of  ^ ; 
while b*r  i  B‘  and  ‘no  collision'  do 

A  pply  the  control  14  =  fiiJ(  bk)  to  the  system; 
Get  the  measurement  Zk+i  from  sensors; 
if  Collision  happens  then  return  Collision; 

U pdate  belief  as  bk+1  =  r(bk,fj,iHbk),zk+1); 

Set  S'  =  S',  then  compute  /J  =  x r9(  S'); 


location  (Choset  et  al.,  2005).  Consider  a  kidnapped  robot 
problem  in  a  known  environment.  J  ust  after  the  robot  is  kid¬ 
napped  it  would  be  risky  to  apply  any  control,  because  the 
robot  may  be  close  to  an  obstacle.  Thus,  in  such  a  sce¬ 
nario,  we  first  initialize  the  system  belief  with  a  Gaussian 
with  large  covariance  and  go  into  an  'information  gather¬ 
ing'  mode,  where  we  do  not  apply  any  control  signal  and 
only  gather  measurements  until  the  covariance  shrinks  to  a 
reasonable  covariance  or  it  remains  unchanged  for  a  sig¬ 
nificant  amount  of  time  (i.e.  when  there  is  no  additional 
i  nformation  to  reduce  the  uncertai  nty).  A  fterwards,  we  con¬ 
nect  the  resulting  belief  to  the  FIRM  nodes  and  continue 
apply i  ng  the  F I R  M  pol  i cy  to  move  the  robot  toward  the  goal 
region.  A  more  efficient  approach  of  handling  this  prob¬ 
lem  is  detailed  in  Agha-mohammadi  et  al.  (2013c)  using 
innovation  signals. 


6.  General  FIRM  framework 

The  goal  of  this  section  is  to  construct  a  general  FIRM 
framework,  assuming  that  there  exists  a  mechanism  to  guar¬ 
antee  belief  reachability.  Asa  result,  if  for  a  certain  class  of 
systems  one  comes  up  with  a  control  ler  that  can  accomplish 
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belief  reachability,  a  graph  in  belief  space  directly  follows 
according  to  this  general  framework. 

To  construct  the  general  FIRM ,  we  start  by  defining  ele¬ 
ments  and  assumptions  needed  in  the  FIRM  construction. 
Accordingly,  we  transform  the  original  intractable  POM  DP 
problem  into  an  SMDP  problem  in  belief  space,  inspired 
by  sampling-based  methods.  Then,  we  construct  an  arbitrar¬ 
ily  good  approximation  to  the  solution  of  this  belief  SMDP 
over  finite  subsets  of  belief  space  (FIRM  nodes).  Doing  so, 
we  end  up  with  a  tractable  M  DP,  the  so-called  FIRM  M  DP. 
We  discuss  this  derivation  first  for  the  obstacle-free  case  and 
then  we  add  the  obstacles  to  the  planning  framework.  We 
characterize  the  quality  of  the  solution  obtained  by  FIRM 
via  its  success  probability  and  provide  a  generic  algorithm 
for  planning  with  FIRM  . 

6.1.  Feedback  controllers  and  reachability 

Belief  transition  probability:  As  discussed  in  Section  4, 
in  partially  observable  environments  the  available  data  for 
decision-making  at  time  step  k  can  be  compressed  as  the 
information  state  or  belief  bk.  As  discussed,  using  dynamic 
estimation  schemes,  belief  can  be  propagated  as  bk+i  = 
r(bk,Uk,Zk+i)  (see  Equation  (1)),  which  can  be  presented  as 
a  one-step  transition  pdf  p(bk+i\bk,Uk)  or  a  one-step  tran¬ 
sition  probability  P(B|bfc,i4)=  fB  p(bk+i\bkluk),  where 
Bel. 

Feedback  controllers  and  induced  transition  probabil¬ 
ity:  In  partially  observable  environments,  at  each  stage,  the 
decision-making  process  is  performed  based  on  the  belief 
at  that  stage.  Therefore,  a  controller  is  a  mapping  from 
the  belief  space  to  the  control  space,  in  other  words,  p(-) : 
B  -»•  U.  Accordingly,  a  controller  p  induces  a  Markov 
chain  with  the  one-step  transition  probability  Pi(  B \b,  pi)  := 
P(  B  | b,  p{  b) )  over  the  belief  space. 

Hitting  time:  Let  T(  D  \b,p)  e  [0,  oo]  denote  the  hitting 
time  on  the  set  D  c  B  under  the  controller  p  starting  from 
belief  b.  Formally  it  is  defined  as 

T(  D  |b,  pi)  :=  min{k  >  0,  bk  e  D  |b0  =  b,  p)  (22) 

Stopping  region :  We  call  region  B  c  B  a  stopping  region 
of  the  controller  p  if  we  force  the  controller  to  stop  execut¬ 
ing  as  the  belief  reaches  the  region  B;  in  other  words,  for  all 
b  e  B,  we  impose  Pi(  B \b,  pi)  =  1. 

n-step  transition  probability:  We  define  the  n-step  transi¬ 
tion  probability  as  the  probability  of  landing  in  the  stopping 
region  B  in  at  most  n  steps: 

Pn(B\b,p):=  Pr(T(B\b,p)<n)  (23) 

Stationary  transition  probability:  Consider  the  controller 
pi  that  starts  executing  from  belief  b  and  stops  execut¬ 
ing  when  the  belief  enters  region  B.  Thus,  we  can  define 
P(B\b,p)  as  the  transition  probability  from  b  to  B  induced 
by  pc,  when  the  controller  stops  executing;  in  other  words, 


V(B\b,pi)  would  be  the  probability  of  landing  in  the  stop¬ 
ping  region  B  in  a  finite  amount  of  time: 

¥(B\b,p)  :=Pr(T(B\b,p)  <  oo)  (24) 

Reachability  and  accessibility:  The  stopping  region  B 
is  called  reachable  under  a  controller  p  starting  from  b  if 
V(B\b,p)  =  1.  The  stopping  regions  is  called  accessible 
under  a  controller  p  from  b,  if  P(  B  | b,  p)  >  0. 

aT-reachability:  The  stopping  region  B  is  called 
aT-reachable  under  a  controller  p  starting  from  b  if 
PT(B\b,p)=  Pr(T(B\b, p)  <  T)>  a,  in  other  words,  if 
the  controller  can  drive  the  system  into  B  in  fewer  than  T 
steps  with  a  probability  greater  than  a. 

Reachability  basin:  The  reachability  basin  B  associated 
with  the  pair  (p,B)  is  the  set  of  all  beliefs  from  which 
B  is  reachable  under  /x  in  the  absence  of  constraints.  The 
reachability  (and  aT-reachability)  basins  are  thus  defined 
respectively  as  follows: 

8  =  {beB  :  P(B\b,p)  =  1}  (25) 

B(a,T)  =  (b  e  B  :  PrfBlb,^)  >  a)  (26) 

Clearly,  B  c  B,  and  in  practical  cases,  B  is  much  smaller 
than  B . 

6.2.  FIRM  graph 

In  this  section,  we  assume  that  there  are  no  constraints  (i.e. 
F  =  0),  and  we  reduce  planning  over  the  entire  belief  space 
to  planning  over  a  representative  graph  constructed  within 
the  belief  space.  Doing  so,  we  can  reduce  the  M  DP  prob¬ 
lem  in  (4)  over  the  continuous  space  into  a  tractable  M  DP 
problem  defined  over  the  graph  nodes. 

Stabilizer  sampling :  The  first  step  in  the  construction 
of  the  proposed  framework  is  to  sample  a  set  of  stabiliz¬ 
ers  {pi},  where  each  stabilizer  p(-)  is  a  mapping  from  the 
belief  space  to  the  control  space.  Typically,  every  stabilizer 
is  characterized  by  a  dv-vector  of  parameters  v  e  Rd|/;  in 
other  words,  we  can  denote  the  jth  stabilizer  more  rigor¬ 
ously  as  pi(  •;  V) :  B  -»•  U.  As  a  result,  we  can  sample  the 
parameters  V  =  {v'}  and  then  construct  a  stabilizer  corre¬ 
sponding  to  each  parameter.  One  can  view  the  set  V  as  a  set 
of  underlying  PRM  nodes  in  the  parameter  space. 

Sampling  FIRM  nodes:  FIRM  nodes  (B;)  are  disjoint  sets 
in  the  belief  space,  where  the/th  node  has  to  be  chosen  such 
that  it  is  reachable  under  the  /-stabilizer;  in  other  words, 
P(  Bi\b,pi)=  1,  with  a  sufficiently  large  B.  We  discuss  the 
sizeofB  further  below.  Note  that,  for  practical  purposes,  the 
reachability  condition  can  be  replaced  by  aT -reachability  if 
needed. 

Connecting  samples:  Consider  a  set  of  N  samples 
{(/*',  B'HfU,  where  the  reachability  basin  of  the  /th  sam¬ 
ple  is  denoted  by  B1.  Now,  consider  {B'}f=1  as  the  nodes  of 
a  graph.  The  node  B'  is  connected  to  the  node  B>  if,  starting 
from  any  b  e  B',\Ne  can  reach  Bi  using  pi.  In  other  words 
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Bl  is  connected  to  the  node  B>  if  Bl  c  &.  Again,  the  reach¬ 
ability  condition  can  be  replaced  by  the  aT-reachability 
condition. 

Checking  connection  condition :  For  simple  systems  (lin¬ 
ear  with  Gaussian  noise)  and  some  controllers  (such  as 
SLQG),  the  connection  condition  can  be  checked  analyti¬ 
cally.  However,  in  general,  checking  this  connection  con¬ 
dition  analytically  may  be  very  difficult.  In  such  cases, 
the  Markov  chain  induced  by  the  controller  can  be  sim¬ 
ulated  numerically  (e.g.  using  particle-based  methods). 
Accordingly,  we  can  approximate  the  reachability  (or  aT- 
reachability)  probability  and  check  if  the  condition  is  true 
or  not.  Since  this  process  is  done  offline,  the  computational 
burden  can  be  tolerated.  However,  as  we  will  see  further 
below,  in  many  cases  designing  suitable  edge  controllers 
in  practice  increases  the  reachability  probability  such  that 
practically  one  can  assume  the  reachability  is  satisfied,  and 
so  there  is  no  need  to  propagate  the  probability  distribution. 

Stopping  region:  By  definition,  the  graph  node  B  associ¬ 
ated  with  the  controller  p  acts  as  the  stopping  region  of  the 
controller.  However,  if  the  process  under  the  stabilizer  hits 
another  graph  node  before  its  corresponding  graph  node, 
we  can  stop  the  controller  and  pick  the  best  controller  from 
this  intermediate  node.  Therefore,  we  can  extend  the  stop¬ 
ping  region  for  all  controllers  to  the  union  of  all  nodes 
:=  uf=1Bl.  As  a  result,  we  will  not  necessarily  have 
¥(Bl\b ,  pl)  =  1  since  the  process  may  hit  some  other  node 
before  Bl.  However,  we  will  have  P(  \b,  pl)  =  1  for  all  i  in 
the  absence  of  constraints. 

Local  controllers  ( simplified  connecting  strategy):  To 
ease  the  connection  step,  and  to  have  more  distant  nodes, 
we  can  precede  each  stabilizer  by  a  time- varying  controller 
(referred  to  as  the  edge  controller).  To  illustrate  this  idea, 
consider  two  nodes  Bl  and  B> ,  where  Bl  <£.  Lf\  that  is,  Bl 
cannot  be  connected  to  Bf  through  pf  In  this  case,  we  can 
connect  the  underlying  state  nodes  v*  and  in  the  state 
space  by  a  finite  trajectory  eij  (say  of  length  i)  and  then 
design  a  time- varying  controller  ~pf  for  k  =  0, 1,  to 
track  this  finite  trajectory.  Therefore,  if  the  node  Bl  is  in  the 
basin  of  reachability  of  the  pair  (plf&),  then  obviously  Bl 
would  be  in  the  basin  of  reachability  of  the  pair  (pV,&)9 
where  the  controller  pij  =  {plQ.L,  pj}.  We  call  p^  the  ( /  J)th 
local  controller,  as  it  connects  the  node  Bl  to  the  node  ff. 

Graph :  Formally,  we  define  the  constructed  graph  with 
the  set  of  nodes  V  =  {B1}^  and  the  set  of  edges  (or  local 
controllers)  M  =  {pij}.  The  set  of  controllers  available  at 
Bl  is  denoted  by  M(  i)  (i.e.  the  set  of  edges  starting  from 
Bl).  Similar  to  PRM,  in  which  the  path  (final  solution)  is 
constructed  as  a  concatenation  of  edges  on  the  roadmap,  in 
FIRM,  the  policy  is  constructed  by  the  concatenation  of  the 
local  policies.  However,  it  is  worth  noting  that  with  this  con¬ 
struction  we  still  perform  planning  in  a  continuous  space 
and  do  not  discretize  the  control  space. 

Local  controllers  versus  macro -actions:  By  the  term 
‘macro-action’  we  mean  a  sequence  of  controls  (actions) 
(He  et  al.,  2010,  2011).  In  other  words,  a  macro-action  is  a 
sequence  of  open-loop  policies.  It  is  important  to  note  that  a 


local  controller  is  not  a  macro-action,  but  rather  a  sequence 
of  policies  (macro-policy),  each  of  which  is  a  mapping 
from  belief  space  to  the  continuous  control  space.  Using 
macro-actions  results  in  an  open-loop  policy,  which  cannot 
compensate  for  the  belief-state  deviation  from  the  planned 
path.  However,  under  local  controllers  (macro-policies),  the 
effect  of  noise  can  be  compensated  for,  due  to  the  feedback 
nature  of  the  controllers,  and  thus,  the  belief  can  be  steered 
towards  a  stopping  region. 

6.3.  Belief  SMDP 

In  this  section,  we  reduce  planning  over  the  entire  belief 
space  into  planning  over  a  subset  of  belief  space,  which  is 
actually  the  union  of  FIRM  graph  nodes;  that  is,  4>  =  u. jB>. 

SMDP  transition  costs:  First,  we  generalize  the  concept 
of  one-step  cost  c(b,u):  B  x  U  — >  M>o  to  the  one-step 
SMDP  cost  C^,ft):BxM^  M>o,  which  represents  the 
cost  of  invoking  the  local  controller  /z(-)  at  belief  state  b\  in 
other  words, 

r 

C\b,n)-.=  YJc{bt,n{bt)\bo  =  b)  (27) 
where  T  :=  T(^\b,  p). 

Belief  SMDP :  According  to  the  above  definitions,  the 
original  POMDP,  formulated  using  DP  in  Equation  (6),  can 
be  reduced  to  an  SMDP  (Sutton  et  al.,  1999)  in  the  belief 
space,  referred  to  as  a  belief  SMDP: 

J\  b)  =  min  C\  b,p)+[ p(  b'\b,  p)J\  b')  db\  Vb  e  B\  Vi 

(28) 

The  integration  over  the  entire  belief  space  in  Equation 
(6)  is  reduced  to  integration  over  the  sampled  nodes  (that 
is,  40  in  Equation  (28),  as  p  stops  executing. 


6.4.  FIRMMDP 

Graph  transitions:  The  DP  in  (28),  though  computationally 
more  tractable  than  the  original  POMDP,  is  defined  on  the 
continuous  neighborhoods  Bl  and  thus  is  still  formidable 
to  solve.  However,  for  sufficiently  small  Bl  and  sufficiently 
smooth  cost  functions,  the  cost-to-go  of  all  beliefs  in  Bl  are 
approximately  equal.  Thus,  we  can  define  the  graph-level 
transition  cost  and  probabilities  Cg  :  ¥  x  M  ->  M  and 
:  V  x  V  x  M  ->  [0, 1]  on  the  FIRM  graph,  in  other 
words,  over  the  finite  space  V,  such  that  p)  is  the 

transition  probability  from  Bl  to  B]  under  the  local  planner 
p.  Similarly,  C8(B\p)  denotes  the  cost  of  invoking  local 
planner  p  at  the  FIRM  node  Bl .  Accordingly,  J8  :  ¥  ->  M 
is  the  cost-to-go  function  over  the  FIRM  nodes.  These 
roadmap-level  quantities  are  defined  using  the  following 
‘piecewise  constant  approximation’,  which  is  an  arbitrar¬ 
ily  good  approximation  for  smooth  enough  functions  and 
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sufficiently  small  Bl : 


VbeB\Vi 


J\Vc)k  J\b) 

C8(B\/jl):=  C*(^,/x)^  Cs{b,/i) 

F*(  •  \B\  /x)  :=  P(  -| b{,  p) «  P(  •  \b,  /x) 

(29) 


where  blc  is  a  representative  point  in  Bl.  For  example,  if  Bl 
is  a  ball,  the  typical  value  for  blc  is  the  center  of  Bl .  This 
approximation  essentially  states  that  any  belief  in  the  region 
Bl  is  represented  by  blc  for  the  purpose  of  decision-making. 

Obstacle-free  FIRM  MDP:  Given  the  approximation  in 
Equation  (29),  the  DP  equation  in  (28)  becomes 

J8(  Bl)  =  J\  bh  =  min  Cs ( bl,  /x)  +  [p(  b' \bi,  /x)  Js(  b')  db' 

neM(i)  Jy 

=  min C\b[,ti)  +  Y  f p(b'\bi,n)Js(b')db' 

neM(i)  j 

^mmC8(B\/jL)  +  J"  f p(b'\bi9fjb)J8(ff)  dbf 

/*eM(i)  “  Jr i 

=  min  C8(B\  fi)  +  Y"  ¥(ff\bi ,  /x) 

/X6M(0 

j 

=  min  Cg(Bi,n)  +  Y'  Jg(  Bl)F8(BJ\Bi , /i) ,  Vi 

lieMii) 

(30) 


Accordingly,  we  can  get  the  graph  feedback  ng  :  Y  — >  M 
through  the  following  DP: 

JS(B‘)  =  minCg(Bi,n)  +  J'Fg(B'\Bi,n)Jg(Bi),  Vi 

/xgM(i')  J 

(31a) 

tt8(B1)  =  SLVgminC8(Bl,  /ji)-\-^F8(Bj\Bl,  /ji)J8(Bj) ,  V7 

/xgM(0  “ 

(31b) 

Thus,  the  original  POMDP  over  the  entire  belief  space 
becomes  a  finite  TV-state  MDP  in  Equation  (31)  defined  on 
the  finite  set  of  FIRM  nodes  V  =  {Bff^ .  We  call  the  MDP 
in  Equation  (31)  the  FIRM  MDP  in  the  absence  of  obsta¬ 
cles.  It  is  worth  noting  that  J8(-):  ¥  — >  M  is  the  cost-to-go 
function  over  the  FIRM  nodes,  which  assigns  a  cost-to-go 
for  every  FIRM  node  B\  and  the  mapping  Tt8f):  ¥  — >  M 
is  a  mapping  over  the  FIRM  graph  from  FIRM  nodes  into 
the  set  of  local  controllers  that  returns  the  optimal  local 
controller  that  has  to  be  taken  at  any  FIRM  node.  Given 
C8(B , /x)  for  all  ( B ,  /x)  pairs,  the  DP  in  Equation  (31)  can 
be  solved  offline  using  standard  techniques  such  as  the 
value/policy  iteration  to  yield  a  feedback  policy  n8  over 
FIRM  nodes  {Bf. 

6.5.  Incorporating  obstacles  into  FIRM  MDP 

In  the  presence  of  obstacles  (i.e.  state  or  control  con¬ 
straints),  we  may  not  assume  that  the  local  controller  /x^(-) 


can  drive  any  b  e  Bl  into  Bi  with  probability  one.  Instead, 
we  have  to  specify  the  failure  probabilities  that  the  robot 
collides  with  an  obstacle  (hits  the  failure  set  F). 

Let  us  generalize  the  transition  probabilities  by  defining 
F(F\b,  /x)  as  the  probability  of  hitting  failure  set  F  before 
hitting  stopping  region  ^  under  /x  starting  from  b.  Simi¬ 
larly,  we  generalize  F8  such  that  F8(  F\B\  /x)  :=  F(F\blc,  /x). 
Finally,  we  generalize  the  cost-to-go  function  by  adding  F 
to  its  input  set,  that  is,  J8  :  {¥,  F}  ->  M>0,  such  that  J8(F) 
is  a  user-defined  suitably  high  cost  for  hitting  obstacles. 
Note  that  the  cost-to-go  from  the  goal  node  is  zero,  that 
is,  Jg(Bgoal)=  0.  Therefore,  we  can  modify  Equation  (31) 
to  incorporate  constraints  by  repeating  the  procedure  in  the 
previous  subsection  to  get  the  FIRM  MDP  in  the  presence 
of  obstacles: 

J8(Bi)=  min  C8(Bi,[i)+J8(F)F8(F\B\[i) 

reM  (i) 

+  ^jJs(Bj)f,g(Bj\Bi,ii)  (32a) 

j 

Tt8(Bi)  =  SiYg  min  C8(B\ii)FJ8{F)F8(F\B\ii) 

+  y^(B0Pg(fi/|fli,At.)  (32b) 

j 

All  that  is  required  to  solve  the  above  DP  equation  is  the 
values  of  the  costs  C8(B\  /x)  and  the  transition  probability 
functions  F8(  -\Bl,  /x).  Thus,  the  main  difference  from  the 
obstacle-free  case  is  the  addition  of  a  ‘failure’  state  to  the 
FIRM  MDP  along  with  associated  probabilities  of  failure 
from  various  nodes  Bl. 

6.6.  Overall  policy  rt 

The  overall  feedback  tt  :  B  — >  U  is  generated  by  combining 
the  global  policy  n8  on  the  graph  and  local  policies  {/x^}. 
Suppose  at  the  kth  time  step  the  active  local  controller  is 
shown  by  /x£.  It  remains  unchanged  (/x£+1  =  /xp  and  keeps 
generating  control  signals  based  on  the  belief  bk  at  each 
time  step,  until  the  belief  reaches  the  corresponding  stop¬ 
ping  region,  TL  Once  the  belief  enters  the  stopping  region 
=  u \jff,  it  is  in  a  graph  node,  say  Bl  e  ¥.  Accordingly, 
the  global  policy  Tt8  chooses  the  next  local  controller,  that 
is,  nl+i  =  7Tg(#p.  Thus,  this  hybrid  policy  is  stated  as 
follows: 


1  H*k(bk),  =  if  hi* 

(33) 

Initial  controller.  Given  the  initial  belief  bo ,  if  bo  is  in  one 
of  the  graph  nodes,  then  we  just  choose  the  best  local  con¬ 
troller  using  n8.  However,  if  bo  does  not  belong  to  any  of 
the  graph  nodes,  we  first  make  a  singleton  set  B°  =  {bo} 
and  connect  it  to  the  graph  nodes  based  on  the  connect 
methods  discussed  in  Section  6.2.  Denoting  the  outgoing 
edges  (local  controllers)  from  B°  by  M(0),  we  compute 
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the  transition  cost  Cg(B°,n),  the  transition  probabilities 
P9(6J  |6°,/u,)  for  all  j,  and  failure  probability  P(F  |B°,m)  for 
invoking  local  controllers  /x  e  M(0)  at  8°.  Then,  we  choose 
the  best  initial  controller  n*0  as 


* 

Mo  = 


arg  min  {C9(B°,/i)+p9(F  \B°,ix)J  9(F) 

fJLE  M(0) 

+  EP9(eiie°,M)i9(e/)},  if  b0  ^  ^  04) 

i 

jtg(Br),  if  3r,s.t.  b0  e  Br 


It  is  worth  noting  that  computing  is  the  only  part  of  the 
computation  that  depends  on  the  initial  belief  bo  and  that 
has  to  be  performed  online;  in  other  words,  if  a  large  devia¬ 
tion  occurs,  /!q  is  the  only  part  that  needs  to  be  reproduced 
for  the  new  initial  point.  After  ^  drives  the  system  to  a 
graph  node,  from  there  on  the  optimal  policy  is  already 
known.  Computing  /ig  is  feasible  in  real  time  as  M(0) 
contains  a  limited  number  of  edges. 


6.7.  Success  probability 

We  would  also  like  to  quantify  the  quality  of  the  solution 
jr  in  the  presence  of  obstacles.  To  this  end,  we  require  the 
probability  of  success  of  the  policy  ng  at  the  higher-level 
Markov  chain  on  FIRM  nodes  given  by  Equation  (32b). 
Without  loss  of  generality  let  us  assume  that  the  first  node 
B1  is  the  goal  node  Bgoal.  The  DP  in  Equation  (32)  has 
N  +  1  states  {F  ,Bgoal,B2, . . .  ,BN]  that  can  be  decomposed 
into  three  disjoint  classes:  the  failure  class  {F },  the  goal 
cl  ass  {8 goal },  and  the  transient  class  (B2,  S3, . . . ,  Bw+1).  The 
goal  and  failure  classes  are  absorbing  recurrent  classes  of 
this  Markov  chain.  As  a  result,  the  transition  probability 
matrix  of  this  higher-level  N  +  1  state  Markov  chain  can 
be  decomposed  as  follows  (Norris,  1997): 


V  = 


Vf  0  O' 

0  'Pgoal  0 

IZf  TZqoal  Q 


(35) 


where  Vgoai  =  Fg(B1\B\-)  =  land  Vf  =p9(F|F,-)=  1, 
since  goal  and  failure  classes  are  the  absorbing  recur¬ 
rent  classes;  in  other  words,  the  system  stops  once 
it  reaches  the  goal  or  it  fails.  Q  is  a  matrix  that 
represents  the  transition  probabilities  between  transient 
nodes  in  the  transient  class,  whose  (/,/)th  element  is 
Q[i,j]  =  W’g(Bi+1\Bi+1,7Tg(Bi+1)).  Vectors  7 Zgoai  and 
lit  are  (A/  -  1)  xl  vectors  that  represent  the  probabil¬ 
ity  of  transient  nodes  Y  \  Bgoal  getting  absorbed  into  the 
goal  and  failure  node,  respectively;  that  is,  llgoailj]  = 
Fg(B1\Bi+1,jTg[Bi+1))an(i'Rf[j]  =  Fg(F\Bi+1,jTg(Bi+1)). 
Then,  it  can  be  shown  that  the  success  probability  from  any 
desired  node  8'  e  V\Bgoal  is  as  follows  (Norris,  1997): 


P(  success|8\  jtg)  :=  P(  Bgoal |8\  ;r9) 

=  r,L1(/  -Qr'Kgoai,  Vi  >2  (36) 


where  r,-  is  a  column  vector  with  all  elements  equal  to  zero 
except  the/th  element,  which  issetto  one.  Note  that  the  vec¬ 
tor  Vs  =  (I  -  Qi^Ugoai  includes  the  success  probability 
from  every  graph  node. 

In  the  next  section,  we  will  discuss  the  success  prob¬ 
ability  in  more  detail  in  the  context  of  probabilis¬ 
tic  completeness.  However,  according  to  the  computed 
P(  success|B',7r9),  one  can  compute  the  success  probability 
from  any  given  initial  belief  bo  as 

P(  success|bo,  tt)  —  T>(  By  |V0,  Mo)  P(  success|By,  ng) 

(37) 

where  n*0  is  given  by  Equation  (34).  Then,  this  suc¬ 
cess  probability  is  compared  with  a  minimum  accept¬ 
able  success  probability,  denoted  by  pmin.  If  the  condition 
P(  success|Vo,7r)  >  Pmin  is  not  satisfied,  then  the  number  of 
nodes  in  the  graph  has  to  be  increased  until  the  condition 
is  satisfied.  If,  from  the  initial  point  b0,  a  successful  policy 
in  the  class  of  admissible  policies  exists,  then  this  proce¬ 
dure  will  eventually  find  a  successful  policy  by  increasing 
the  number  of  nodes,  due  to  the  probabilistic  completeness 
of  the  method,  which  is  discussed  in  Section  7.1. 

6.8.  Generic  FIRM  algorithms 

The  generic  algorithms  for  the  offline  construction  of  FIRM 
and  onl  i  ne  pi anni  ng  w i th  F I R  M  are  presented  i  n  A I gori thms 
3  and  4,  respectively.  Concrete  instantiations  of  these  algo¬ 
rithms  for  SLQG-FIRM  are  given  in  Algorithms  1  and  2, 
respectively. 


Algorithm  3:  Generic  construction  of  the  FIRM  graph 

(offline) 

1  Sample  a  set  of  stabilizer  parameters  V  =  {v1}  and 
construct  stabilizers  M  =  {ii1}  accordingly; 

2  Sample  a  set  of  belief  nodes  V  =  (8'}  such  that  they 
satisfy  the  reachability  condition; 

3  Connect  the  belief  nodes  using  local  controllers 

4  For  each  8'  and  m  e  M(  /),  compute  the  transition  cost 
C9(B',/x)  and  transition  probabilities  Vg(Bi\B',ii)  and 
P9(F  |B',/i)  associated  with  invoking  /i  at  8'; 

5  Solve  the  graph  DP  in  Equation  (32)  to  compute 
feedback  ng  over  graph  nodes,  and  compute  n 
accordingly; 


Single-query  versus  multi-query:  As  mentioned  earlier, 
most  approaches  for  planning  in  belief  space  in  continu¬ 
ous  state,  action,  and  observation  spaces  result  in  query- 
dependent  plans.  However,  one  of  the  contributions  of 
FIRM  is  that  its  construction  does  not  depend  on  the  query. 
In  Algorithms  3  and  4,  it  is  assumed  that  the  goal  is  fixed 
for  all  queries;  in  this  case  in  the  planning  phase  we  are  only 
robust  to  changes  in  the  starting  point  of  the  query.  How¬ 
ever,  to  make  the  algorithms  also  robust  to  changes  in  the 
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Algorithm  4:  Generic  planning  (or  replanning)  on 
FIRM  (online) 

1  Given  an  initial  belief  bo,  invoke  the  controller  /xo(-)  in 
Equation  (34)  to  take  the  robot  into  some  FIRM  node  B  \ 

2  while  B  ^  Bgoal  do 

3  Given  the  system  is  in  FIRM  node  B ,  invoke  the 
global  feedback  policy  tv  8  to  choose  the  local 
feedback  policy  /x(-)  =  n8(B); 

4  Let  the  local  controller  /x(-)  execute  until  the  robot 
is  absorbed  into  a  FIRM  node  B'  or  until  it  hits  the 
failure  set; 

5  if  Collision  happens  then  return  Collision; 

6  Update  current  node  B  <—  B'  \ 


goal  belief,  one  can  just  move  the  last  line  of  Algorithm  3  to 
the  first  line  of  Algorithm  4.  Note  that  the  computationally 
expensive  part  of  Algorithm  3  is  the  computation  of  edge 
costs,  which  is  independent  of  the  start  and  goal  locations 
of  the  submitted  query. 

6.9.  Discussion 

In  summary,  in  FIRM  we  aim  to  transform  the  original 
POMDP  problem  into  a  belief  SMDP  problem  and  solve 
it  on  a  subset  of  belief  space.  Given  the  smoothness  of  the 
cost  function  and  transition  probabilities,  the  solution  of  the 
FIRM  MDP  is  arbitrarily  close  to  the  solution  of  the  belief 
SMDP  over  FIRM  nodes.  The  important  characteristic  of 
FIRM  is  that  it  is  solved  offline  and  thus  performing  the 
online  phase  of  planning  (or  replanning)  is  computationally 
feasible  in  real  time.  To  exploit  the  generic  FIRM  frame¬ 
work,  one  has  to  find  ( B ,  /x)  pairs,  where  B  is  reachable  (or 
aT-reachable)  under  /x,  as  FIRM  nodes  and  edges.  Also, 
transition  costs  and  probabilities  need  to  be  computed. 
Finally,  the  corresponding  FIRM  MDP  needs  to  be  solved, 
which  provides  a  global  feedback  policy  on  the  graph  that 
can  be  used  in  planning,  as  detailed  in  Algorithm  4.  SLQG- 
FIRM,  presented  in  Section  5,  is  an  instance  of  FIRM,  in 
which  the  design  of  local  controllers  t$  and  FIRM  nodes 
Bl  is  based  on  the  properties  of  SLQG  controllers. 

7.  Probabilistic  completeness  under 
uncertainty 

In  this  section,  we  extend  the  concept  of  probabilistic  com¬ 
pleteness  of  planning  algorithms  for  deterministic  systems 
to  the  concept  of  probabilistic  completeness  of  planning 
algorithms  under  uncertainty,  based  on  Agha-mohammadi 
et  al.  (2012b).  Accordingly,  in  the  next  subsection,  we 
discuss  the  probabilistic  completeness  of  the  FIRM-based 
algorithms.  We  start  by  reviewing  the  definition  of  success 
and  probabilistic  completeness  in  the  deterministic  case, 
and  then  we  extend  these  definitions  to  the  stochastic  case. 

Success  in  the  deterministic  case :  In  the  deterministic 
case,  such  as  conventional  PRM,  the  outcome  of  the  plan¬ 
ning  algorithm  is  a  path.  Thus,  success  is  defined  for  paths: 


for  a  given  initial  and  goal  point,  a  successful  path  is  a 
path  connecting  the  start  point  to  the  goal  point  which  lies 
entirely  in  the  obstacle-free  space. 

Probabilistic  completeness  in  the  deterministic  case : 
In  the  absence  of  uncertainty,  a  sampling-based  motion¬ 
planning  algorithm  is  probabilistically  complete  if  by 
increasing  the  number  of  samples,  the  probability  of  finding 
a  successful  path,  if  one  exists,  asymptotically  approaches 
one. 

A  difference  between  the  deterministic  and  the  proba¬ 
bilistic  case :  In  the  presence  of  uncertainty,  success  cannot 
be  defined  for  a  path  and  has  to  be  defined  for  a  pol¬ 
icy.  Indeed,  on  a  given  path,  different  policies  may  result 
in  different  success  probabilities.  Moreover,  under  uncer¬ 
tainty,  one  can  only  assign  a  probability  to  reaching  the 
goal.  Thus,  to  define  success  for  a  policy  we  consider  a 
threshold  pmin  e  [0, 1]  and  decide  about  success  or  failure 
accordingly. 

Successful  policy :  In  the  presence  of  uncertainty,  the 
solution  of  the  planning  algorithm  is  a  function,  called 
a  closed-loop  policy  or  feedback.  Therefore,  success  is 
defined  for  policies:  for  a  given  initial  belief  bo  and  goal 
region  B§oal ,  a  successful  policy  is  a  policy  under  which  the 
probability  of  reaching  the  goal  from  the  given  initial  point 
is  greater  than  some  predefined  threshold  pmin.  In  other 
words,  7T  is  successful  for  a  given  bo  if  P(  success \bo,  n)\= 
f>(Bgoal\bo,7t)>  pmin. 

Policy  in  sampling-based  methods:  In  sampling-based 
methods,  a  policy  is  parametrized  by  a  set  of  samples.  These 
samples  can  be  in  the  state  or  belief  space,  based  on  the 
algorithm.  Let  us  denote  these  samples  in  a  generic  space 
by  {yu  Y2,  •  •  • ,  Yn}-  Thus,  we  can  highlight  the  dependency 
of  the  sampling-based  policy  on  the  samples  by  the  notation 
7r(  •;  {yi,  y2, . . . ,  Xv}).  The  number  of  samples  is  denoted 
by  A. 

Strong  probabilistic  completeness  under  uncertainty 
( SPCUU ):  Suppose  there  exists  a  successful  policy  ft.  Then 
a  sampling-based  motion-planning  algorithm  is  SPCUU  if 
increasing  the  number  of  samples  without  bound  causes 
the  probability  of  finding  a  successful  policy  to  approach 
one.  In  other  words,  if  there  exists  a  successful  policy  re, 
then  we  have  the  following  property  for  the  sampling-based 
policy  7T : 

lim  nBgoal\b0,7t)>pmin  (38) 

N^OO 

where  N  is  the  number  of  samples  in  the  sampling-based 
method. 

Achieving  an  algorithm  that  is  SPCUU  requires  search¬ 
ing  in  the  entire  space  of  policies,  which  is  a  computa¬ 
tionally  intractable  task.  Usually  in  solving  POMDPs  the 
space  of  admissible  policies  is  restricted  to  a  sufficiently 
rich  subset  of  policy  space,  denoted  by  II,  within  which  the 
method  searches  for  the  best  policy.  Restricting  the  success¬ 
ful  policy  to  the  set  II,  we  define  below  a  weaker  notion  of 
PCUU. 
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PCUU :  Suppose  that  there  exists  a  successful  policy 
ft  £  n.  Then,  a  sampling-based  motion-planning  algorithm 
is  PCUU  if,  when  increasing  the  number  of  samples  with¬ 
out  bound,  the  probability  of  finding  a  successful  policy 
approaches  one.  In  other  words,  if  there  exists  a  success¬ 
ful  policy  ft  e  n,  then  for  the  sampling-based  policy  tv,  we 
have  lim  ¥(Bs°al\b0,n)>  pmin. 

As  discussed  in  Section  6,  in  FIRM,  inspired  by  the 
sampling-based  PRM  framework,  this  reduction  from  the 
entire  function  space  to  the  restricted  set  of  policies  n  is 
performed  by  sampling  feedback  local  planners  and  con¬ 
catenating  them.  Therefore,  the  structure  of  local  planners 
defines  the  set  TL  Each  local  planner  p}l  is  parametrized 
by  its  corresponding  parameter  \f  However,  as  mentioned 
in  Section  6.2,  we  can  consider  the  set  V  =  {v*}  as  the 
set  of  underlying  PRM  nodes.  Thus,  any  policy  tt  e  Yl 
is  parametrized  by  the  set  of  underlying  PRM  nodes  V  = 
{v*}^.  We  highlight  this  dependency  explicitly  through  the 
notation  n(  •;  V).  Therefore,  the  PCUU  condition  for  FIRM 
can  be  written  more  explicitly  as 

lim  P(  Bgoal  |  ,  7T ( • ;  V))  >  pmi„  (39) 

N^O O 

For  a  concrete  instantiation  of  FIRM,  we  can  explicitly 
characterize  the  set  TL  For  example,  in  SLQG-FIRM,  n 
is  the  set  of  all  possible  policies  that  can  be  generated  by 
concatenating  LQG  controllers. 


also.  The  stopping  regions  {if}  in  the  belief  space  and  the 
stopping  region  F  in  the  state  space  can  both  be  extended 
to  the  h- state  space,  respectively  denoted  by  {tf}  and  T, 
where  &  C  X/*  and  T  C  X/*  are  defined  as 

&  :=  {(X,  b)  \X  e  X/ree,  b  e  ff]  (40) 

F:={(X,b)\X  eF,be  B}  (41) 

Sj  :=  &  U  T,  s'  :=  X*  \  Sj  (42) 

where  S’  and  S1 ,  respectively,  denote  the  entire  stopping 
region  and  transient  region  under  the  local  controller  iff 
Absorption  probability  of  local  planners :  If,  under  the 
dynamics  induced  by  the  local  planner,  the  system  reaches 
the  target  node  If,  the  local  planner  is  considered  to  be  suc¬ 
cessful;  if  the  system  hits  an  obstacle,  the  local  planner  is 
considered  to  have  failed.  The  success  probability  of  local 
planners  (i.e.  the  absorption  probability  into  FIRM  nodes) 
is  computed  by  solving  the  following  integral  equation  that 
results  from  the  law  of  total  probability: 

F(B\X,py)  =  f  p^\x'\X)F(B\X',py)dX' 

Jxh 

=  J  j/\X' \X)  dX' 

BJ 

+  J p^{X'\X)¥(B\X',^)  dX'  (43) 
st 


7.1.  Probabilistic  completeness  of  FIRM 

Obviously,  FIRM-based  methods  are  not  SPCUU  algo¬ 
rithms.  However,  in  this  section,  we  show  that  under  mild 
practical  conditions,  FIRM-based  methods  are  PCUU  algo¬ 
rithms.  We  first  provide  an  analysis  of  the  local  plan¬ 
ners  in  belief  space,  and  then  state  the  assumptions  more 
rigorously. 

Notation:  The  norm  ||  •  ||  is  the  supremum  norm  when  it  is 
applied  to  functions.  The  norm  ||  •  \\op  is  applied  on  operators 
and  it  stands  for  the  operator  norm  (Keener,  2000).  It  is 
worth  noting  that  in  this  section,  by  the  word  ‘continuous’, 
we  mean  ‘Lipschitz  continuous’.  Finally,  we  assume  that 
Xfree  is  a  compact  set. 

Hyper-state :  X  =  (x,b)e  X/7  is  referred  to  as  hyper- state 
(or  h- state),  which  is  a  state-belief  pair.  The  space  of  all  h- 
states  is  called  hyper-state  space  (h-state  space)  X/*  =  X  x 
B.  Further,  p^(X'\X)  denotes  the  one-step  transition  pdf 
induced  by  the  local  controller,  /x,  over  the  h-state  space. 
Also,  let  ¥n(S \X,  /x)  denote  the  transition  probability  from 
h-state  X  into  the  set  S  C  X/*  in  at  most  n  steps. 

Local  planner  and  extended  stopping  region:  The  role  of 
the  ( z,/)th  local  planner  or  local  controller  is  to  drive  the 
belief  from  the  region  Bl  to  its  stopping  region  if  in  the 
belief  space  (for  ease  of  notation,  we  ignore  the  case  where 
the  controller  can  stop  in  any  FIRM  node,  and  we  restrict 
its  stopping  region  to  if).  In  the  presence  of  obstacles,  we 
extend  the  concept  of  stopping  region  to  include  obstacles 


where  the  second  equality  in  Equation  (43)  follows  from 
substituting  the  following  conditions,  inherited  from  FIRM 
construction,  into  the  first  integral: 


X,p!j)  = 


ifX  e  If 
ifX  e  T 


(44) 


Henceforth,  we  drop  indices  i  and  j  to  unclutter  expres¬ 
sions.  Thus,  we  can  write 

F(B\X,p)  =  jp»(X'\X)  dX' 

B 

+  Jp»(X'\X)F(B\X',p)  dX' 

S 

=  R(X)+Ts\P(B\;pL)UX)  (45) 


where  the  operator  and  the  function  R(  X)  are 
defined  as 

T5  IfO) ](X)  :=  j p»(X'\X)f(X')dX', 

S 

R(X):=  J p^{X'\X)  dX'  (46) 

B 


The  solution  of  the  integral  equation  in  Equation  (45) 
is  expressed  in  the  following  as  a  Liouville-Neumann 
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series  (Keener,  2000),  similar  to  the  solution  of  the  inho¬ 
mogeneous  Fredholm  equation  of  second  type  (Keener, 
2000): 

oo 

P(£|*,M)=y>^[fl(-)](A’)  (47) 

n—  1 

We  show  that  the  series  in  Equation  (47)  is  a  convergent 
series  by  resorting  to  the  following  assumption,  which  is  a 
weaker  version  of  the  aforementioned  FIRM  condition  on 
the  design  of  nodes  and  local  controllers. 

Assumption  1.  We  assume  that  there  exists  some  time  step 
N  at  which  the  controller  stops  with  a  positive  probability. 
Mathematically,  there  exists  an  N  <  oo  and  a  f  >  0  such 
thatFN(Sj\X,  piij)>  f>  >  0,  for  all  X. 

This  assumption  is  almost  always  true,  as  it  rephrases  the 
role  of  the  controller  in  driving  the  system  toward  the  target 
region.  For  example,  if  we  have  Gaussian  noise  (as  is  the 
case  in  the  SLQG-FIRM),  the  assumption  is  true  at  N  =  1 
regardless  of  the  utilized  controller. 

Lemma  4.  Given  Assumption  1,  we  have 

'||T£IU,<1,  n<N 

■  IWsWop  <  l  -  P  <  h  n  >  N  (48) 

En0°=ollT"5L,<c<oo 

Proof.  See  Appendix  E.  □ 

Corollary  1.  The  series  ^  a  convergent  series, 

and  therefore  we  can  define  the  resolvent  operator  (/  — 
Ts)-1  [R]  =  £~0T”  [fl],  where  ||(7  -  T^)-1  \\op  <  c  < 

OO. 

Proof.  See  Appendix  F.  □ 

According  to  Corollary  1,  the  success  probability  of  the 
local  controller  pi  can  be  written,  using  the  defined  resolvent 
operator,  as 

P(S|^,m)=(/-T5)-1  [/?(•)  ](*)  (49) 

As  the  first  result  of  this  section  (Proposition  1),  we  aim 
to  show  that  this  absorption  probability  varies  continuously 
with  respect  to  changes  in  parameters  of  the  local  planner. 
However,  we  will  first  state  two  assumptions. 

Assumption  2.  We  assume  the  local  planning  law  and 
induced  transition  probabilities  are  smooth;  in  other  words, 
we  have  assume  the  following. 

•  Local  control  laws  are  continuous  in  their  parame¬ 
ters,  that  is,  for  the  ( i,j)th  local  controller,  mapping 
piij(  •;  v7) :  B  U  is  a  continuous  function  in  its 
parameter  v7. 

•  The  transition  pdf  on  h-state,  that  is,  p(X'\X,u), 
is  a  continuous  function  of  the  control  u;  in 
other  words,  there  exists  a  c\  <  oo  such  that 
|| p(  X'\X,  u)  —p(  X'\X ,  u)  ||  <  c 1 1| zi  —  u\\. 


Finally,  we  state  the  following  assumption,  in  which  we 
emphasize  the  fact  that,  as  v  — >  v,  the  transition  probability 
induced  by  the  local  controller  /z(  •;  v)  into  the  sets  B  and  B 
also  has  to  converge,  which  is  a  reasonable  assumption  for 
a  smooth  control  law. 

Assumption  3.  Consider  the  controllers  /r(-;v)  and 
/x(-;v),  whose  corresponding  extended  absorption  regions 
are  denoted  by  B  and  B,  respectively.  We  assume  that  there 
exist  real  numbers  r  >  0  and  d  <  oo  such  that  for 
|| v  —  v||  <  r,  we  have 

\\F\(B  Q  B\X,fi)\\  <  c'||v  —  v||  (50) 

where  ©  is  the  symmetric  difference  operator;  in  other 
words,  B  ©  B  =  ( B  \  B)  U(  B  \  B). 

Now  we  state  the  following  proposition  on  the  continuity 
of  the  success  probability  of  local  planners. 

Proposition  1  (Continuity  of  absorption  probabilities). 
Given  Assumptions  1,  2,  and  3,  the  absorption  probability 
¥(ff\b,  pfi)  is  continuous  in  parameter  v7  for  all  ij ,  and  b. 

Proof.  See  Appendix  G.  □ 

Now  we  present  the  main  result  regarding  the  probabilis¬ 
tic  completeness  of  FIRM-based  methods. 

Theorem  1.  Given  Assumptions  1,  2,  and  3,  any  planning 
algorithm  under  uncertainty  that  is  generated  based  on  the 
FIRM  framework  (i.e.  guarantees  belief  node  reachability 
and  induces  a  roadmap  in  the  belief  space  with  independent 
edge  costs )  is  PCUU. 

Proof.  See  Appendix  H.  □ 

The  basic  idea  of  probabilistic  completeness  under  uncer¬ 
tainty  stems  from  an  idea  similar  to  the  one  in  the 
path-isolation-based  analysis  for  planners  in  deterministic 
systems.  Roughly  speaking,  in  the  path  isolation  argument 
for  sampling-based  planners  in  the  absence  of  uncertainty, 
if  there  is  a  successful  path  and  a  non-zero  neighborhood  of 
this  path,  in  which  every  path  is  successful,  we  can  eventu¬ 
ally  find  a  path  in  this  neighborhood  by  increasing  the  num¬ 
ber  of  samples  unboundedly.  Similarly,  in  the  presence  of 
uncertainty,  if  there  is  a  successful  policy,  it  is  parametrized 
by  some  parameters  (set  of  PRM  nodes,  in  FIRM).  Thus, 
if  there  exists  a  non-zero  measure  neighborhood  of  these 
parameters,  within  which  selected  parameters  lead  to  a  suc¬ 
cessful  policy,  we  can  eventually  reach  a  successful  pol¬ 
icy  by  increasing  the  number  of  samples  unboundedly  and 
choosing  samples  in  the  target  neighborhoods. 

8.  Experimental  results 

In  this  section,  we  first  illustrate  theoretical  results  from  the 
previous  sections  on  a  planar  robot  in  a  small  3D  planning 
domain.  Then,  we  present  planning  results  on  a  larger  3D 
state  space.  Finally,  we  report  the  results  of  the  method  on 
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a  dynamical  model  of  an  eight-arm  manipulator  (sixteen- 
degree-of-freedom  state  space).  This  section  is  followed  by 
a  brief  comparison  with  other  state-of-the-art  methods  in 
Section  9. 

8.1.  Planar  3D  omnidirectional  robot:  Illustrat¬ 
ing  steps  in  construction  and  planning  with 
SLOG-FIRM 

In  this  subsection,  we  focus  on  an  omni-directional  robot. 
Its  state  is  composed  of  its  2D  position  in  the  plane  and  its 
heading  angle.  The  goal  in  this  section  is  to  illustrate  the 
steps  of  constructing  SLQG-FIRM  and  planning  with  it. 

8.1.1.  Motion  model.  A  three-wheel  omnidirectional 
mobile  robot  is  used  in  experiments  with  the  nonlinear 
kinematic  model  given  in  Kalmar-Nagy  et  al.  (2004).  The 
state  vector  is  composed  of  a  2D  location  and  heading 
angle  x  =  [1x,2x,0]T  in  a  global  world  frame.  Further, 
u  =  [1u,2u,3u]T  is  the  vector  of  controls,  where  'u  is  the 
linear  velocity  of  the  /th  wheel.  The  motion  model  for  this 
robot,  in  its  original  continuous  form,  is  (Kalmar-Nagy 
et  al.,  2004) 

x  =  fc(x,  u,  w)  =  T(x)  u  +  w  (51) 

where  w  is  the  motion  noise,  which  is  drawn  from  a  zero- 
mean  Gaussian  distribution,  and 

/— §  sin(  6>)  -|sin(f-0) 

T(x)=l  | cos( O)  -fcos^f-60 

V  3 r  3r 

where  r  is  the  distance  of  the  wheels  from  the  robot's  center 
of  mass.  The  discrete  motion  model  is  shown  by 

x*  =  Hx*_i,u*_i,wrk_i)  (53) 

where  ~  A f(  0,  Q)  is  the  motion  noise  at  the  kth  time 
step,  which  is  drawn  from  a  zero-mean  Gaussian  distribu¬ 
tion  with  covariance  matrix  Q.  It  can  be  shown  that  if  we 
linearize  this  system,  the  linearized  motion  model  satisfies 
the  controllability  condition  in  Property  1. 

8.1.2.  Observation  model.  In  experiments,  the  robot  is 
equipped  with  exteroceptive  sensors  that  provide  range  and 
bearing  measurements  from  existing  landmarks  (radio  bea¬ 
cons)  in  the  environment.  The  2D  location  of  the /th  land¬ 
mark  is  denoted  by  Lj.  Measuring  Lj  can  be  modeled  as 
follows: 

iz  =  ih(x,iv) 

=  [||Jd||,  atan2(yd2M)  -  0]T  +  'v,  'v  ~  JV(Q'R) 

(54) 

where yd  =  [ydi,yd2]T  :=  [1xf  2x]T  -  Lj.  The  vector  >v  is  a 
state-dependent  observation  noise,  with  covariance 

R  =  diag(  ( ^Il'dU  +  orb)2 ,  ( d||  +  aeb )2 )  (55) 


In  other  words,  the  uncertainty  (standard  deviation)  of  the 
sensor  reading  increases  as  the  robot  gets  farther  from  the 
landmarks;  rjr  =  rj9  =  0.3  determines  this  dependence, 
and  ub  =  0.01  m  and  =  0.5°  are  the  bias  standard 
deviations.  A  similar  model  for  range  sensing  is  used  in 
Prentice  and  Roy  (2009).  We  assume  the  robot  observes  all 
Nl  landmarks  at  all  times  and  their  observation  noises  are 
independent.  Thus,  the  total  measurement  vector  is  denoted 
by  z  =  [1zT,2zT, . . .  ,Nlzj]j ,  and,  due  to  the  independence 
of  measurements  of  different  landmarks,  the  observation 
model  for  all  landmarks  can  be  written  as 

z  =  h(x)+v,  v  —  AT( O,  R),  R  =  diag^R, ...,NlR) 

(56) 

It  is  straightforward  to  show  that  the  linearized  version 
of  this  observation  model  satisfies  the  observability  con¬ 
dition  in  Property  1.  Therefore,  this  entire  system  model 
(motion  and  sensing  models)  satisfies  Property  1  and  thus 
the  SLQG-FIRM  can  be  used  for  planning. 


8.1.3.  Construction  of  SLQG-FIRM  nodes  and  edges.  Fig¬ 
ure  4(a)  shows  a  sample  environment,  including  obsta¬ 
cles,  landmarks,  and  enumerated  nodes  in  ( 1x,2x,0)  space. 
Nodes  are  shown  by  blue  triangles,  which  encode  the  posi¬ 
tion  ( 1x,2x)  and  heading  angle  6  of  the  robot.  Landmarks 
are  shown  by  black  stars.  The  corresponding  FIRM  nodes 
are  computed  and  shown  in  Figure  4(b).  All  elements  in  Fig¬ 
ure  4(b)  are  defined  in  ( 1x,  2x,  0)  space  but  only  the  (Jx,  2x) 
portion  of  them  is  shown.  Each  =  ( s/,P's)  is  illustrated 
by  a  red  dot  representing  \i ,  and  a  green  ellipse  represent¬ 
ing  the  3<t  ellipse  of  covariance  P's.  Each  FIRM  node  By  is 
a  neighborhood  around  b(.  In  the  experiments,  we  define 
the  node  region  using  the  component- wise  version  of  Equa¬ 
tion  (18)  to  handle  the  error  scale  difference  in  position  and 
orientation  variables: 

B'  =  [b  =  (x,P)  \  \x-\J\  <e,\P  -P‘s\  <  A}  (57) 

where  |  •  |  and  <  stand  for  the  absol  ute  val  ue  and  component- 
wise  comparison  operators,  respectively.  We  also  set  e  = 
[0.07  m,0.07  m,  1°]T  and  A  =  eeT  to  quantify  BT  The 
projection  of  By  onto  the  space  of  estimation  mean,  that  is, 
B'x  =  (x+  :  fx+  -  V|  <  e},  is  a  neighborhood  around  V, 
which  is  shown  by  a  cyan  rectangle  centered  at  V.  Projec¬ 
tion  of  B 1  onto  the  space  of  estimation  covariances,  that  is, 
Bj,  =  {P  :  |P  -  P 4 1  <  A},  is  a  neighborhood  around 
P{.  However,  in  a  2D  plot  B'p  cannot  be  shown  due  to 
its  high  dimension.  Thus,  we  partially  illustrate  it  only  by 
two  dashed  green  ellipses  that  represent  3ct  covariances  of 
P's  -  A d  and  Pi  +  A d,  where  A<j  is  the  matrix  A,  whose  off- 
diagonal  elements  are  set  to  zero.  For  illustration  purposes, 
both  of  these  neighborhoods  (i.e.  BJX  and  B'p)  are  magnified 
5x  in  Figure  4(b). 

8.1.4.  Transition  costs  and  probabilities.  After  designing 
FIRM  nodes  and  local  controllers,  the  transition  costs  and 


|  sin(  f  +  o) 
— §  cos(  f  +0) 
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Fig.  4.  (a)  The  underlying  PRM  graph.  Gray  polygons  are  the  obstacles  and  black  stars  represent  the  landmarks’  locations,  (b)  FIRM 
nodes  corresponding  to  PRM  nodes. 


probabilities  need  to  be  computed.  Based  on  the  given  task 
and  needed  accuracy,  different  approaches  can  be  taken. 
Here,  we  use  a  particle-based  approximation  of  the  distri¬ 
bution  to  compute  these  quantities,  where  we  use  M  —  100 
particles.  In  other  words,  for  every  ( B ,  \i)  pair,  we  perform 
100  runs.  At  every  run,  a  sample  path  of  state  x,  a  sample 
path  of  estimation  mean  v+,  and  a  sample  path  of  estimation 
covariance  P  is  generated.  If  the  filter  of  choice  in  the  edge 
controller  is  the  linearized  Kalman  filter  (LKF)  (Crassidis 
and  Junkins,  2004;  Simon,  2006),  the  covariance  evolution 
is  deterministic  and  there  is  no  need  to  generate  100  differ¬ 
ent  sample  covariance  paths.  However,  if  the  filter  of  choice 
in  the  edge  controller  is  the  EKF  (Crassidis  and  Junkins, 
2004;  Simon,  2006),  then  we  need  to  generate  the  sample 
covariance  paths  too,  to  take  into  account  the  stochasticity 
of  the  covariance  matrix.  Figure  5(a)  depicts  sample  paths 
of  the  true  state  v  and  estimation  mean  in  green  and 
dark  red,  respectively,  for  M  =  100  particles.  Note  that 
when  a  true  state  path  (green  path)  collides  with  an  obstacle, 
the  process  stops  and  failure  happens.  However,  in  this  fig¬ 
ure,  for  illustration  purposes,  we  continue  the  process  and 
ignore  the  obstacles  to  better  show  the  uncertainty  tube  and 
information  availability  in  different  parts  of  the  space.  As 
seen  in  Figure  5(a),  the  behavior  of  the  true  state  on  the 
edges  which  have  access  to  more  accurate  observations  is 
remarkably  close  to  the  planned  behavior.  In  contrast,  on  the 
edges  that  get  less  informative  observations,  the  controller 
cannot  effectively  compensate  for  deviations  of  the  ground 
truth  from  the  nominal  path,  which  can  lead  to  collision 
with  obstacles. 

To  avoid  clutter,  Figure  5(b)  the  depicts  sample  esti¬ 
mation  covariance  evolution  only  for  a  single  particle.  In 
this  figure,  we  let  the  process  and  observation  noise  be 
zero,  to  keep  the  centers  of  the  ellipses  (i.e.  estimation 
means)  on  the  planned  points.  However,  note  that  in  gen¬ 
eral  the  estimation  mean  is  affected  by  the  noise  (as  is 


seen  in  Figure  5(a)).  Indeed,  Figure  5(b)  can  be  seen  as  the 
maximum-likelihood  estimation  uncertainty  tube  over  the 
roadmap. 

Fet  us  denote  the  qth  sample  path  for  the  true  state  by 
Xq fTq ,  for  the  estimation  mean  by  ,  and  for  the  esti¬ 
mation  covariance  by  P^Tq ,  where  Tq  is  the  stopping  time 
of  the  qth  particle  in  executing  /z  at  B.  Moreover,  one  can 
assign  a  weight  to  each  particle  q  based  on  the  probabil¬ 
ity  of  its  occurrence.  There  are  different  ways  proposed  to 
compute  these  weights  in  the  sequential  Monte  Carlo  liter¬ 
ature  (Doucet  et  al.,  2001).  However,  the  main  condition  is 
that  they  have  to  sum  to  one,  in  other  words,  li  =  1- 
Here  we  simply  consider  =  M~l.  Note  that  if  we  par¬ 
ticle  fiq  at  B\  all  these  quantities  also  need  to  have  an  ij 
superscript.  Having  these  sample  paths,  we  can  compute  the 
transition  costs  and  probabilities  associated  with  invoking 
j±q  at  Bl.  For  the  collision  probability,  we  have 

M 

¥g(F\B‘,  vfi )  =  /i,J]  «  w(q)lF{x^rq)  (58) 

9=1 

fj*)  =  1  -  F8(F\B\^)  (59) 

where  IF  is  the  failure  indicator.  lF{x^rq)  is  one  if  there 
exists  a  time  step  k  <  T{q)  such  that  x^  e  F;  otherwise  it  is 
zero.  Further,  Tq,  or  more  rigorously  Tq(q) ,  is  the  stopping 
time  of  the  qth  particle  in  executing  \iq  at  Bl .  To  compute 
Tqiq) ,  we  only  need  to  check  the  condition  b  e  at  every 
time  step  and  find  the  first  time  step  where  belief  b  enters 
the  stopping  region  B' .  Thus,  we  can  compute  the  mean 
stopping  time  as 

M 

Tij  =  E[Tij]  «  wiq)Tii(q)  (60) 

<7=1 


Downloaded  from  ijr.sagepub.com  at  Texas  A&M  University  -  Medical  Sciences  Library  on  November  16,  2013 


22 


Thelntemational journal  ofRoboticsResearch  0(0) 


(a)  (b) 

Fig.  S  Sample  paths  induced  by  controllers  invoked  at  different  nodes,  (a)  For  M  =  100  particles,  sample  ground  truth  paths  and 
sample  estimation  mean  paths  are  shown  in  green  and  dark  red,  respectively,  (b)  The  most  likely  path  under  the  optimal  policy  and  the 
shortest  path  are  shown  in  red  and  yellow  respectively.  The  3a  M  L  estimation  uncertainty  tube  is  drawn  in  blue. 


To  compute  the  filtering  cost  defined  in  Section  5.5,  again 
we  use  the  particle- based  representation  of  belief, 


<j =  E 


t>i 


£tr(P*)|BV 

k= 1 


M  TV 


(61) 

q=  1  k= 1 


where  Pk]  is  the  estimation  covariance  at  the  kth  time  step 
of  the  qth  particle.  Finally,  the  cost  of  taking  g'>  at  B<  is  as 
follows: 


Cg(Bi,nii)  =  a  +a2?ii 


where  we  used  the  coefficients  ai  =  0.95  and  017  =  0.05. 
Table  1  shows  these  quantities  for  several  (B',gij)  pairs  in 
FIRM  corresponding  to  Figure 5. 

8.1.5.  Planning  and  replanning  on  FIRM.  Plugging  the 
computed  transition  costs  and  probabilities  into  Equation 
(32),  we  can  solve  the  DP  and  compute  the  graph  policy  jr9. 
This  process  is  performed  once  offline  if  the  goal  location 
is  fixed.  Figure  6(a)  shows  the  policy  7tg  on  the  constructed 
FIRM  in  this  example.  Indeed,  at  every  FIRM  node  B',  the 
policy  7T9  decides  which  local  controller  has  to  be  taken, 
which  in  turn  aims  to  take  the  robot  to  the  next  FIRM 
node.  Thus,  the  online  part  of  the  planning  is  significantly 
efficient  and  reduces  to  executing  the  controller  and  gener¬ 
ating  the  control  signal,  which  is  almost  an  instantaneous 
computation. 

Replanning:  An  important  consequence  of  this  frame¬ 
work  is  that  replanning  can  be  performed  using  FIRM  effi¬ 
ciently.  Suppose  due  to  some  unmodeled  large  disturbance, 
the  robot's  belief  deviates  significantly  from  the  planned 
path;  in  other  words,  for  some  appropriate  norm  ||  •  ||  on 
belief  space  we  have  life*  -  E[bj’]||  >  g,  where  bpk  is  the 
planned  belief  at  the  kth  time  step,  and  g  is  the  threshold 
for  deciding  if  replanning  is  needed  or  not.  In  such  cases, 


replanning  occurs  based  on  Algorithm  2.  In  Figure  6(b), 
we  illustrate  a  simple  replanning  process.  In  this  figure  it 
is  assumed  that  an  unmodeled  large  disturbance  affects  the 
system  such  that  the  estimation  mean  significantly  deviates 
from  the  planned  path.  The  deviated  mean  is  shown  on  the 
figure  as  the  'restart  point'.  Thus,  based  on  Algorithm  2, 
we  connect  this  point  to  PRM  .  In  Figure  6(b)  the  newly 
added  PRM  edges  (i.e.  £(0))  are  shown  by  dashed  green 
lines.  Then,  for  every  edge  in  £(  0),  we  design  a  local  con¬ 
troller.  Call  the  set  of  newly  constructed  local  controllers 
M(  0).  For  every  g  e  M(  0)  compute  corresponding  transi¬ 
tion  costs  and  probabilities.  Finally,  according  to  Bellman's 
principle  of  optimality,  we  use  the  precomputed  costs-to- 
go  J  9(-)  to  decide  which  controller  has  to  be  taken  at  the 
'restart  point'  using  Equation  (34).  Taking  this  controller, 
the  belief  enters  into  a  FI  RM  node,  and  from  there  again  we 
can  use  the  precomputed  ng  to  control  the  robot  toward  the 
goal  region. 

We  show  the  most  likely  path  under  jrg  in  red  in 
Figure  5(b).  The  shortest  path  is  also  illustrated  in  Fig¬ 
ure  5(b)  in  yellow.  It  can  be  seen  that  the 'most  likely  path 
under  the  best  policy'  detours  from  the  shortest  path  to  a 
path  along  which  the  filtering  uncertainty  is  smaller  and  it 
is  easier  for  the  controller  to  avoid  collisions. 


8.2.  Larger  environment 

In  this  section,  we  consider  the  same  omnidirectional  robot 
with  the  same  observation  model,  and  we  perform  planning 
in  a  larger  environment  (shown  in  Figure  7)  whose  size  is 
almost  10,000  m2.  Every  grid  square  is  a  10  x  10  area.  The 
standard  deviation  of  the  process  noise  is  assumed  to  be 
1  m  for  the  positional  degrees  of  freedom  and  7°  for  the 
angular  degree  of  freedom.  We  start  with  a  five-node  FIRM 
and  at  every  step  we  randomly  sample  five  more  nodes  until 
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Table  1.  Computed  costs  for  several  node-controller  pairs  in  FIRM  using  100  particles. 

( B\ii V)  pair 

S*,/x1,4 

bV4’8 

ba 

00 

0° 

o 

fllO./x10’11 

BV’3 

B3,ai3’6 

B<V6’12 

97% 

95% 

99% 

77% 

79% 

87% 

55% 

79% 

<j>y 

18.5967 

11.2393 

6.8229 

15.1148 

26.2942 

23.6183 

48.8189 

43.6207 

E  [T‘J] 

238.2 

193.0 

150.0 

209.6 

170.8 

200.3 

242.4 

219.2 

<y[Tin 

21.8 

28.7 
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Fig.  6.  Planning  and  replanning  on  FIRM,  (a)  Policy  ng  resulted  from  solving  the  DP  in  Equation  (31)  is  shown  by  red  arrows.  Indeed, 
for  every  FIRM  node,  the  policy  ng  tells  us  which  controller  has  to  be  taken,  (b)  In  this  figure  it  is  assumed  that  an  unmodeled  large 
disturbance  affects  the  system  such  that  the  estimation  mean  significantly  deviates  from  the  planned  path.  The  deviated  mean  is  denoted 
by  ‘restart  point’  in  the  figure. 


we  reach  500  nodes.  Thus,  overall,  we  construct  100  FIRM 
graphs  in  this  environment,  for  each  of  which  we  measure 
the  construction  time  (cumulative)  and  compute  the  suc¬ 
cess  probability.  Plots  in  Figure  8  show  these  quantities  as 
a  function  of  the  number  of  nodes  for  a  sample  run  on  an 
Intel  i5  dual-core  1.7  GHz  machine  with  4  GB  memory. 
Further,  50  particles  are  used  for  collision-checking,  and 
every  node  in  the  underlying  PRM  is  connected  to  its  three 
nearest  neighbors. 

Basically,  FIRM  construction  is  an  anytime  algorithm  in 
the  sense  that  one  can  increase  the  number  of  nodes  and 
stop  enlarging  the  graph  when  a  termination  condition  is 
satisfied  such  as:  (i)  achieving  a  desirable  success  probabil¬ 
ity  or  a  desirable  cost-to-go,  (ii)  no  change  being  observed 
in  the  success  probability  or  in  the  cost-to-go  for  a  signifi¬ 
cant  time,  or  (iii)  exceeding  the  maximum  time  allowed  for 
offline  computation. 

Again,  as  is  seen  in  Figure  7,  the  highest-likelihood  path 
under  the  optimal  policy  detours  from  the  shortest  path 
toward  the  more  informative  regions  in  the  environment. 
As  a  result,  it  reduces  the  collision  probability  and  at  the 
same  time  increases  the  estimation  accuracy  and  controller 
efficiency.  However,  it  is  important  to  note  that  the  returned 
solution  is  not  a  single  path,  but  it  is  a  feedback  law  over 
the  entire  space.  For  the  video  of  executing  this  plan  (with 
fewer  nodes  to  unclutter  the  video),  see  Extension  1. 


We  also  conducted  a  simulation  to  illustrate  the  robust¬ 
ness  of  the  method  to  large  deviations.  In  this  simulation, 
the  robot  is  pushed  away  from  the  roadmap  several  times 
by  some  large  disturbances,  and  replanning  is  performed 
online  based  on  Algorithm  2.  The  video  of  this  simulation 
is  also  available  (see  Extension  2). 

8.3.  Eight-arm  manipulator 

On  a  given  graph,  the  number  of  paths  between  two  given 
points  grows  exponentially  with  the  size  of  graph.  Thus, 
in  the  direct  propagation  of  uncertainty  on  a  roadmap,  the 
number  of  edge  costs  and  transition  probabilities  that  need 
to  be  computed  is  exponential  in  the  number  of  underly¬ 
ing  PRM  nodes  (see  Section  9  for  a  detailed  analysis).  As 
a  result,  when  we  deal  with  high-dimensional  state  spaces, 
where  PRM  needs  to  have  many  edges  and  nodes,  it  is  not 
feasible  to  use  the  methods  that  perform  direct  uncertainty 
propagation.  However,  using  FIRM,  we  only  need  to  com¬ 
pute  the  costs  and  transition  probabilities  for  as  many  edges 
as  the  underlying  PRM  has.  Thus,  we  can  easily  increase 
the  dimension  to  the  level  that  PRM  can  handle,  and  the 
complexity  of  the  algorithm  is  increased  only  by  a  con¬ 
stant  factor  (involving  computation  of  costs  and  transition 
probabilities  of  a  single  edge).  In  the  following  experi¬ 
ment,  we  verify  the  effectiveness  of  FIRM  in  handling 
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(a)  50  nodes. 


(b)  75  nodes. 


(c)  105  nodes. 


(d)  275  nodes. 


(e)  425  nodes. 


(f)  500  nodes. 


Fig.  7.  (a)-(f)  Different  snapshots  of  the  roadmap  for  50,  75, 105,  275,  425,  and  500  nodes,  respectively.  The  most  likely  paths  under 
the  optimal  plan  are  also  shown  in  blue.  Stars  show  the  landmarks.  The  means  and  covariances  of  the  FIRM  node  centers  are  shown  by 
small  blue  triangles  and  their  associated  red  ellipses,  respectively.  Also,  see  Extensions  1  and  2  regarding  the  video  of  planning  with 
FIRM  in  this  environment. 


Fig.  &  A  sample  run  showing  (a)  the  success  probability  of  the  generated  plan  versus  the  number  of  nodes,  as  well  as  (b)  the  construction 
time  (offline)  for  the  plan. 


high-dimensional  systems  through  a  simple  example  of  an 
eight-arm  manipulator.  To  the  best  of  our  knowledge,  this 
is  the  first  belief- space  planner  that  can  provide  a  plan  over 
an  entire  roadmap  for  an  eight-dimensional  system  while 
incorporating  expensive  costs  in  planning,  such  as  com¬ 
puting  collision  probabilities.  This  experiment  shows  that 
FIRM  can  be  used  as  a  practical  tool  in  many  real-world 
problems. 


8.3.1.  Motion  model.  We  consider  an  eight-arm  manipu¬ 
lator  with  eight  revolute  joints  in  the  plane.  The  state  of 
the  system  is  described  by  the  angles  of  joints  and  their 
velocities  x  =  { 6>i, . . . , ^8. ^i.  -  -  - .  ^s)T.  and  the  available 
control  is  considered  to  be  the  angular  acceleration  (or 
torque)  of  joints  u  =  (ai,a2, . . .  ,as).  The  process  noise 
w  =  ( wi,  w2, . . . ,  w8)  is  modeled  as  a  zero-mean  Gaussian 
noise  on  angular  accelerations.  Therefore,  the  continuous 
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motion  model  for  every  link  is  Q-,  =  a-,  +  w-„  whose  discrete 
version  for  the  entire  state  can  be  written  as 


xk+i=Axk+Buk  +  Gwk  (62) 


where 


A  = 


1 8  1 8<5f 
08  1 8 


,  B  = 


08 

leVst 
(63) 


where  St  is  the  time  interval  between  two  consecutive  time 
steps,  and  ln  and  0n  are  the  identity  matrix  and  square  zero 
matrix  of  dimension  n,  respectively. 


8.3.2.  Observation  model.  We  use  the  light-dark  environ¬ 
mentsetting  as  the  observation  model,  which  is  also  used  in 
Platt  etal.  (2010,  2011).  In  the  light-dark  environment,  the 
accuracy  of  sensory  readings  is  encoded  by  a  gray  level, 
in  which  the  regions  that  have  access  to  more  accurate 
sensory  readings  are  lighter  than  the  regions  that  do  not 
have  access  to  such  informative  sensory  readings.  In  this 
experiment,  we  assume  that  we  measure  the  state  of  the 
system,  but  this  measurement  is  more  accurate  as  we  get 
closer  to  the  left  wall  on  which  our  sensor  is  mounted.  (This 
model  is  adopted  from  Platt  et  al.  (2010).)  Thus,  we  have 
z  =  h(x)  =  [z1, . . . ,  z8]7,  where 

z'  =  0/  +  vi,  v'  ~  J\f( 0,  ( r]\x'  —  /|  +  at)2 )  (64) 

where  x'  is  thex  coordinate  of  the  /th  joint  location,  /  is  the 
I ocati on  of  the  verti cal  wall,  ^  defi nes  the  dependency  of  the 
noise  standard  deviation  on  the  distance  from  the  wall,  and 
ab  is  the  bias  standard  deviation.  Figure  9  shows  an  example 
of  such  an  environment,  in  which  /  =  -1.5,  rj  =  0.1,  and 
ob  =  10-4.  The  full  observation  model  can  be  written  as 

zk  =  h(xk)  =  Hxk  +  Mvk  (65) 

where  H  =[/8,08]andM  =/8. 


8.3.3.  Sampling  stabilizer  parameters.  The  described  sys¬ 
tem  is  a  controllable  and  observable  system,  and  thus  we 
adopt  the  5LQG  controller  as  the  stabilizing  controller. 
Therefore,  the  parameters  of  the  controller  are  points  in 
the  equilibrium  space,  as  explained  in  Section  5.  In  other 
words,  to  generate  sample  nodes  in  the  state  space,  we  need 
to  sample  the  configuration  space  (0\, . .  .,0s)  and  append 
zero  angular  velocities  to  it.  To  connect  these  samples  in 
the  state  space  we  design  simple  trajectories  between  nodes, 
along  which  we  accelerate  the  joints  (angles)  with  constant 
acceleration  until  they  are  halfway  to  the  next  node,  and 
after  which  we  decelerate  the  joints  until  they  reach  the  next 
node. 


8.3.4.  Construction  of  the  SLQG-FIRM  and  planning  with 
it.  First,  corresponding  to  sampled  nodes  in  the  state  space, 


we  compute  corresponding  FIRM  nodes  and  then  design 
local  controllers  according  to  Algorithm  1.  In  a  similar  pro¬ 
cedure  to  the  one  in  the  previous  experiment,  we  compute 
the  transition  costs  and  probabilities. 

To  solve  the  DP,  we  need  to  characterize  the  goal  nodes. 
In  Figure  9,  the  goal  region  for  the  tip  location  of  the 
manipulator  is  shown  by  a  purple  circle.  We  mark  all  PRM 
samples  whose  tip  locations  are  within  the  goal  region  as 
goal  nodes.  Setting  the  cost-to-go  to  zero  for  all  goal  nodes, 
we  solve  the  DP  and  compute  the  optimal  feedback  on  the 
graph  according  to  Algorithm  1.  Finally,  we  execute  the 
plan  based  on  Algorithm  2  and  we  illustrate  the  propagation 
of  the  covariance  of  the  manipulator  tip  in  Figure  9  in  red. 
As  can  be  seen  in  Figure  9,  there  are  two  passages  among 
the  obstacles  to  reach  the  goal  region.  Although  the  right 
passage  is  closer  to  the  initial  configuration  of  the  manip¬ 
ulator,  the  manipulator  detours  to  a  longer  path  through 
the  left  passage,  because  there  is  more  accurate  sensory 
information  available  in  the  left  passage  than  the  right  one. 
As  is  seen  in  this  example,  the  feedback  plan  minimizes 
the  collision  probability  and  picks  the  safest  path,  while 
being  robust  to  deviations.  In  other  words,  if  for  any  reason 
the  manipulator  deviates  significantly  from  the  underlying 
PRM ,  the  feedback  plan  connects  the  deviated  belief  to  the 
best  neighboring  FIRM  node  in  real  time,  and  continues  the 
pre-computed  plan  from  this  node. 


9.  Comparison  and  limitations 

In  this  section,  we  perform  a  short  comparison  of  SLQG- 
FIRM  against  the  two  most  related  methods  in  the  literature: 
BRM  (Prentice and  Roy,  2009)  and  LQG-M  P  on  roadmaps 
(Van  den  Berg  et  al.,  2011).  Both  methods  are  belief-space 
planners  that  exploit  roadmap-based  ideas.  We  compare 
the  methods  in  terms  of  the  offline  construction  and  online 
planning  complexity,  and  also  in  terms  of  some  other 
properties,  all  listed  in  Table  2.  In  the  following,  we  go  over 
the  complexity  analysis  that  leads  to  the  entries  in  this  table. 
Afterwards,  we  discuss  limitations  of  the  SLQG-FIRM  . 

Offline  construction  complexity:  In  a  general  graph,  the 
number  of  paths  between  two  given  nodes  is  exponential  in 
the  number  of  nodes  W .  For  example,  if  each  node  in  a  graph 
is  connected  to  k  nearest  neighbor  nodes  on  the  graph,  for 
a  search  depth  of  d  edges  on  the  graph,  the  corresponding 
search  tree  contai  ns  kd  paths.  Notice  that  each  of  these  paths 
has  d  edges  on  it.  Thus,  if  we  directly  (without  using  belief 
stabilizers)  propagate  the  uncertainty  on  a  roadmap  for  a 
depth  of  d,  we  have  to  evaluate  the  cost  on  dkd  edges.  So, 
the  asymptotic  complexity  of  the  overall  problem  is  of  the 
order  0(NkN).  Now,  if  computing  the  cost  and  transition 
probabilities  associated  with  each  edge  under  uncertainty  is 
a  constant  multiplier  0{  c)  of  computing  its  cost  in  a  deter¬ 
ministic  case,  the  overall  complexity  of  the  methods  based 
on  direct  belief  propagation  is  0(cNkN).  On  the  other  hand, 
i  n  any  vari ant  of  F I R  M ,  due  to  the  edge  i  ndependence,  only 
the  cost  of  O(Nk)  edges  needs  to  be  constructed  as  i  n  P  R  M , 


Downloaded  from  ijr.sagepub.com  at  Texas  A&M  University  -  Medical  Sciences  Library  on  November  16,  2013 


26 


The  International  Journal  of  Robotics  Research  0(0) 


Table  2.  Belief  space  roadmap-based  method  comparison  (without  using  a  heuristic  in  search  algorithms). 

Algorithm 

Offline 

construction 
complexity 
(no  heuristic) 

Replanning 

(online 

planning) 

complexity 

Future 

observations 

System 

requirement 

Valid  region 
of  plan 

Collision 

probabilities 

Generic  PRM 

0{Nk) 

0{k) 

Assumes  a  controller 
exists  to  drive  the 
system  from  node  to 
node 

On  the  graph 
only 

BRM 

0(cNkN) 

0(c^kN)  or 
0(cNkN ) 

Maximum 

likelihood 

observation 

Well  linearizable 
systems 

Vicinity  of  the 
nominal  path 

Not  considered 

LQG-MP  on 
roadmaps 

Generic  FIRM 

0(cNkN) 

O(cNk) 

0(cNkN) 

O(ck) 

All  observations 

Well  linearizable 
systems 

Assumes  a  controller 
exists  to  drive  the 
system  from  node  to 
node 

Vicinity  of  the 
nominal  path 

Union  of  con¬ 
vergence  regions  of 
local  controllers 

Simplified 

measures  are 

used 

SLQG-FIRM 

O(cNk) 

O(ck)  or  0(  1) 

All  observations 

Well  linearizable,  and 
linear  controllable  and 
observable  systems 

Vicinity  of  whole 
PRM  (entire  space 
for  a  dense  PRM) 

Computed 

and  thus  the  overall  complexity  of  offline  construction  of 
FIRM  is  0(  cNk). 

Online  planning  ( replanning )  complexity:  If  the  system 
deviates  from  the  valid  region  of  the  plan,  in  direct  prop¬ 
agation  methods,  edge  costs  need  to  be  recomputed  for  all 
edges.  So,  in  BRM  and  LQG-MP  on  roadmaps,  the  replan¬ 
ning  complexity  will  be  of  the  order  G(NkN).  If  the  cost 
of  each  edge  is  defined  in  such  a  way  that  it  only  depends 
on  the  belief  at  the  start  and  end  of  the  edge  (i.e.  does  not 
depend  on  the  belief  along  the  edge),  BRM  can  reduce  the 
computation  complexity  to  0(c(N //) kN)  through  covari¬ 
ance  factorization  techniques,  where  /  is  assumed  to  be  the 
length  (number  of  steps)  of  each  edge.  In  FIRM,  in  the 
case  of  replanning  (submitting  a  query  with  new  starting 
point),  it  is  only  necessary  to  connect  the  deviated  belief  to 
k  neighboring  FIRM  nodes.  Thus,  we  only  need  to  compute 
the  cost  for  the  k  new  edges.  It  is  worth  noting  that  if  the 
underlying  PRM  is  dense  enough  that  the  valid  region  of 
the  local  controllers  covers  the  space,  edge-cost  computa¬ 
tion  in  the  replanning  phase  reduces  to  zero,  because  if  the 
system  deviates  out  of  a  valid  region  of  a  local  planner,  it 
will  fall  into  the  valid  region  of  some  other  planner. 

To  reduce  the  complexity  of  the  search  algorithm  in  BRM 
and  LQG-MP  on  roadmaps,  it  is  assumed  that  the  costs 
on  different  edges  of  the  roadmap  are  independent.  This 
heuristic  can  reduce  the  complexity  of  the  algorithm,  but 
it  may  still  be  significantly  high  compared  to  the  PRM  or 
FIRM.  Moreover,  this  heuristic  (edge-independent  assump¬ 
tion)  is  not  true  without  having  belief  stabilizers,  and  thus 
search  algorithms  relying  on  such  a  heuristic  may  result  in 
solutions  arbitrarily  different  from  the  true  solution  of  the 
search  algorithm.  Assuming  that  no  such  heuristic  is  used 


in  the  search  algorithm,  Table  2  summarizes  the  complexity 
of  these  algorithms. 

The  huge  reduction  in  the  computational  complexity  of 
the  planning  algorithm  (in  particular,  in  the  online  phase) 
opens  many  possibilities  in  utilizing  POMDP  solvers  in 
real-world  applications.  Moreover,  due  to  its  sampling- 
based  nature,  it  ameliorates  the  curse  of  dimensionality  just 
as  PRM  does  in  the  deterministic  case.  In  other  words,  if 
the  dimension  of  the  system  increases,  we  need  a  greater 
number  of  nodes  N  in  the  underlying  PRM  to  capture  the 
free  space  connectivity,  in  which  case  we  cannot  use  direct 
methods  due  to  their  complexity.  However,  FIRM  can  tol¬ 
erate  the  increase  in  the  dimension  since  its  complexity  is 
only  a  constant  multiplier  of  the  PRM  complexity. 


9.1.  Limitations  of  the  SLQG-FIRM  and  future 
directions 

In  this  section,  we  discuss  limitations  of  the  proposed 
method.  It  is  important  to  distinguish  which  limitation  is 
associated  with  the  generic  FIRM  framework,  and  which 
limitation  is  associated  with  the  particular  presented  instan¬ 
tiation  of  the  FIRM,  that  is,  the  SLQG-FIRM.  In  some 
cases,  we  also  propose  ways  to  remedy  these  limitations  as 
future  research  directions. 

Stabilization  time:  The  FIRM  framework  introduces  the 
usage  of  belief  stabilizers.  However,  the  time  needed  for  the 
belief  stabilization  procedure  is  added  to  the  overall  execu¬ 
tion  time.  If  the  number  of  time  steps  along  the  nominal 
path  is  /,  and  the  number  of  time  steps  needed  for  stabiliza¬ 
tion  is  r,  the  extra  time  t  is  usually  negligible  compared 
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Fig.  9.  A  result  of  executing  the  FIRM  plan  for  an  eight-arm 
manipulator  in  a  light-dark  (sensing)  environment.  The  manipu¬ 
lator  is  attached  to  the  origin  ( 0, 0)  and  the  purple  region  is  the 
goal  region  for  the  manipulator  tip.  To  unclutter  the  figure,  we 
only  show  the  uncertainty  of  the  manipulator  tip  (end-effector). 
The  initial  mean  and  covariance  are  shown  in  black,  and  the  evo¬ 
lution  of  the  tip  covariance  during  the  plan  execution  is  shown  in 
red.  The  final  estimation  mean  and  the  true  configuration  of  the 
manipulator  are  shown  in  blue  and  green,  respectively.  Obstacles 
are  shown  in  brown.  The  manipulator  follows  a  longer  but  safer 
path  to  the  goal  region  through  the  left  passage,  compared  to  the 
shorter  but  risky  (with  high  collision  probability)  path  through  the 
right  passage. 

to  /.  However,  r  can  increase  as  the  sensing  uncertainty 
increases.  In  such  a  situation,  one  can  consider  two  cases:  if 
obstacles  are  close  to  the  robot,  it  is  indeed  unsafe  to  move 
with  a  poor  estimate,  and  it  is  indeed  better  to  lose  some 
time  to  gain  more  information,  and  then  start  moving.  On 
the  other  hand,  if  there  is  no  obstacle  close  to  the  robot, 
then  one  can  increase  the  size  of  the  corresponding  FIRM 
node,  and  thus  decrease  the  extra  stopping  time.  Moreover, 
efficient  sampling-based  methods,  which  are  aware  of  avail¬ 
able  information  at  different  locations  of  the  environment, 
and  thus  aware  of  the  mean  stabilization  time,  can  be  used 
to  efficiently  sample  the  nodes  in  the  locations  with  lower 
mean  stabilization  times.  These  issues  open  up  new  direc¬ 
tions  for  future  research.  However,  if  an  application  is  very 
sensitive  to  the  extra  time,  FIRM  may  not  be  a  good  choice 
for  it,  and  methods  such  as  BRM  or  LQG-MP  can  result  in 
better  guarantees  on  execution  time. 

Controllability  and  observability:  As  mentioned  in  Sec¬ 
tion  5,  SLQG-FIRM  works  for  systems  that  satisfy  Property 
1 ,  which  basically  requires  the  linearized  system  about  the 
PRM  nodes  to  be  controllable  and  observable.  Although 
this  includes  a  large  class  of  systems,  it  excludes  some 


important  systems,  such  as  non-holonomic  systems  that 
are  not  linearly  controllable  about  any  point.  It  is  worth 
noting  that  this  is  not  a  limitation  of  the  generic  FIRM 
framework,  but  a  limitation  of  the  SLQG-FIRM.  More 
recent  instantiations  of  FIRM,  such  as  PLQG-based  FIRM 
(Agha-mohammadi  et  al.,  2012c)  or  DFL-based  FIRM 
(Agha-mohammadi  et  al.,  2012a),  aim  to  relax  the  control¬ 
lability  requirements  in  Property  1  and  thus  can  include 
non-holonomic  systems  as  well.  However,  relaxing  the 
observability  assumption  is  still  an  open  problem. 

Gaussian  beliefs:  The  reachability  argument  in  the 
SLQG-FIRM  is  restricted  to  Gaussian  beliefs.  In  other 
words,  we  cannot  guarantee  reachability  to  some  pre¬ 
defined  non-Gaussian  beliefs  with  SLQG  controllers.  This 
issue  is  a  subject  of  future  research. 

Increasing  the  uncertainty:  Although  it  may  rarely  hap¬ 
pen  in  practice,  it  is  possible  to  have  a  situation  that  leads 
to  an  uncertainty  growth  during  the  belief- stabilization  pro¬ 
cess.  However,  this  issue  can  be  addressed  easily.  Notice 
that  FIRM  nodes  are  known  a  priori.  Thus,  at  the  begin¬ 
ning  of  each  stabilization  procedure,  we  can  compare  the 
current  belief  with  the  stationary  belief  of  the  stabilizer.  If 
the  current  belief  has  more  information  than  the  stationary 
belief  (e.g.  if  ah  eigenvalues  of  the  estimation  covariance 
are  strictly  less  than  the  corresponding  eigenvalues  of  the 
stationary  estimation  covariance),  we  replan  from  the  cur¬ 
rent  belief  based  on  Algorithm  2.  Therefore,  uncertainty 
will  not  be  increased  during  the  stabilization  procedure. 

Locally  linearizable  systems:  If  a  linear  representation  of 
the  system  of  interest  cannot  be  obtained  (e.g.  if  the  system 
state  lives  in  a  discrete  set  of  states),  the  class  of  methods 
that  use  the  linearized  system  as  a  local  approximation  of 
the  true  system  will  not  work.  In  this  case,  another  class 
of  methods  can  be  adopted  which  can  handle  these  systems 
much  better,  such  as  those  in  Smith  and  Simmons  (2005), 
Kurniawati  et  al.  (2008),  and  Kurniawati  et  al.  (2011).  Com¬ 
ing  up  with  belief  stabilizers  that  work  in  discrete  state 
space  settings  to  design  a  discrete-state  variant  of  FIRM  is 
also  an  area  for  future  research. 

Velocity  reduction  in  dynamical  systems:  To  apply 
SLQG-FIRM  to  dynamical  systems,  the  underlying  PRM 
samples  need  to  be  selected  from  the  equilibrium  space,  in 
other  words,  they  need  to  have  zero  velocity.  As  a  result  a 
reduction  in  the  system’s  velocity  is  expected  while  trying 
to  reach  the  FIRM  nodes.  However,  in  many  applications, 
reducing  the  speed  at  nodes  to  gain  the  robustness,  reli¬ 
ability,  and  scalability  offered  by  FIRM  may  be  a  useful 
trade-off.  Nevertheless,  this  reduction  in  speed  may  not  be 
desirable  for  some  applications  where  the  system  cannot  (or 
should  not)  decrease  its  velocity.  For  such  systems,  Agha- 
mohammadi  et  al.  (2013a)  propose  a  FIRM  variant  based 
on  periodic  controllers  which  does  not  require  a  reduction 
in  the  system’s  velocity.  However,  designing  more  efficient 
variants  of  FIRM  that  can  sample  points  with  non-zero 
velocities  without  introducing  periodicity  in  the  system’s 
motion  is  an  interesting  future  research  direction. 
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10.  Conclusion 

In  this  paper,  we  have  proposed  the  FIRM  framework  for 
solving  the  motion-planning  problem  under  motion  and 
sensing  uncertainties.  This  problem  is  originally  a  POM  DP, 
whose  solution  is  computationally  intractable.  Exploiting 
feedback  controllers,  we  reduced  it  to  a  tractable  FIRM 
M  DP  that  can  be  solved  using  standard  DP  techniques. 
FIRM  utilizes  feedback  controllers  to  create  reachable  node 
regions  in  belief  space.  An  important  consequence  is  that 
FIRM  preserves  the  optimal  substructure  property  on  the 
roadmap  and  thus  overcomes  the  curse  of  history  in  the 
original  POM  DP  problem.  Finally,  by  computing  the  col¬ 
lision  probabilities,  obstacles  are  also  appropriately  taken 
into  account  in  planning  on  FIRM .  We  showed  an  instan¬ 
tiation  of  the  abstract  FIRM  framework  using  SLQG  con¬ 
trollers  and  illustrated  the  construction  and  planning  results 
on  it.  By  extending  the  probabilistic  completeness  con¬ 
cept  to  planners  under  uncertainty,  we  also  showed  that 
FIRM  is  probabilistically  complete  under  uncertainty.  We 
believe  that  FIRM  provides  an  important  step  toward  solv¬ 
ing  POM  DPs  and  utilizing  them  as  a  practical  tool  for  robot 
motion  planning  under  uncertainty. 
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Appendices 

A.  Index  to  Multimedia  Extensions 

The  multimedia  extension  page  is  found  at 
http  ://www.ijrr.org 

Table  of  Multimedia  Extensions 


Extension  Type 

Description 

1 

Video 

Executing  the  FIRM  plan  in  the  environment 
shown  in  Figure  7 

2 

Video 

Real-time  replanning  with  FIRM,  which  shows 
the  robustness  of  the  method  to  large 
disturbances 

B.  Time-varying  LQG  controller 

The  time-varying  LQG  controller  is  often  used  to  track  a 
pre-planned  trajectory  (also  called  a  nominal,  desired,  or 
open-loop  trajectory)  in  the  presence  of  process  and  obser¬ 
vation  noise.  In  principal  it  is  designed  (and  optimal)  for 
linear  systems  with  Gaussian  noise,  but  it  can  also  be  uti¬ 
lized  for  stabilizing  nonlinear  systems  locally  around  the 
planned  trajectory.  An  LQG  controller  is  composed  of  a  KF 
as  an  estimator  and  an  LQR  as  a  controller.  At  every  time 
step  k ,  the  KF  provides  the  a  posteriori  distribution  (belief) 
bk  over  the  system  state,  and  the  LQR  generates  control  u k 
based  on  bk. 

In  this  appendix,  we  first  discuss  the  system  lineariza¬ 
tion  and  planned  nominal  trajectory,  and  then  discuss  the 
KF,  LQR,  and  LQG  corresponding  to  this  nominal  trajec¬ 
tory.  Consider  the  nonlinear  partially  observable  state-space 
equations  of  the  system  as  follows: 

Xk+\  =f(xk,uk,wk),  wk  ~  M(  0,  Qk)  (66a) 

Zk  =  h(xk,  Vi) ,  Vi  ~  U( 0, Rk)  (66b) 

A  planned  nominal  trajectory  for  this  system  is  a 
sequence  of  planned  states  (^k)k> o  and  planned  controls 
( u%)k> o  such  that  it  is  consistent  with  the  noiseless  dynam¬ 
ics  model;  in  other  words,  we  have 

=f(4,4,  o)  (67) 

The  planned  trajectory  can  be  a  finite  sequence  of  some 
length  N.  The  role  of  a  closed-loop  stochastic  controller, 
during  the  trajectory  tracking,  is  to  compensate  for  the 
robot’s  deviations  from  the  planned  trajectory  and  to  keep 
the  robot  close  to  the  planned  trajectory  in  the  sense  of 
minimizing  the  following  quadratic  cost: 


E 


VUj  -  -x£)T  Wx(xk  -  xfy  +(  Mi  -  m£)T  Wu{  Mi  -  lfk) 

(68) 


,k>  0 


where  Wx  and  Wu  are  positive-definite  weight  matrices  for 
the  state  and  control  costs,  respectively. 

Since  the  state  space  is  not  fully  observable  and  is  only 
partially  observable,  we  do  not  have  access  to  the  perfect 
value  of  the  state  xk,  and  thus,  we  provide  the  estimate  xk 
of  the  state  xk  based  on  the  available  observations  zo±  from 
the  beginning  up  to  the  current  time  step.  Then,  based  on  the 
separation  principle  (Kumar  and  Varaiya,  1986;  Bertsekas, 
2007),  it  can  be  shown  that  in  a  linear  system  with  Gaus¬ 
sian  noise,  the  above  minimization  in  terms  of  the  error 
xk  —  x%  is  equivalent  to  performing  two  separate  minimiza¬ 
tions  based  on  the  estimation  error  xk  —x£  and  the  controller 
error  —  x£,  whose  summation  is  the  same  as  the  orig¬ 
inal  main  error  xk  —  x?k  =  (xk  —  x^)+(*£  —  x£),  where 
=  E[x^  ]  =  E[xfc|zo:A:]-  As  a  major  consequence,  the 
design  of  the  stochastic  controller  with  a  partially  observ¬ 
able  state  space  (LQG)  reduces  to  designing  a  controller 
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with  a  fully  observable  state  (LQR)  and  designing  an  esti¬ 
mator  (KF),  separately.  In  the  following,  we  first  discuss  the 
linearization  of  a  nonlinear  model.  Then  we  discuss  how  a 
KF  and  an  LQR  can  be  designed  for  this  linearized  system. 
Finally,  we  combine  them  to  construct  a  time- varying  LQG 
controller. 

Model  linearization'.  Given  a  nominal  trajectory 
(^k,i^k)k> o>  we  linearize  the  dynamics  and  observation 
model  in  Equation  (66)  as  follows: 

Xk+ 1  =  f(xpk,upk,0)+Ak(xk-xpk)+Bk(uk-upk) 

+  Gkwk,  wk  ~  V(  0,  Qk )  (69a) 

zk  -  h(xpk,0)+Hk(xk  -xpk)+Mkvk,  vk  ~Af(0,Rk) 

(69b) 

where 

o). 

G,  =  ),H,  =  fK,0), 

3  w  dx 

dh  rj 

Mk  =  —(4, 0)  (70) 

3v 

Now,  let  us  define  the  following  errors: 

•  LQG  error  (main  error):  ek  =  xk  —  x^ 

•  KF  error  (estimation  error):  ek  =  xk  —  xjJ" 

•  LQR  error  (estimation  of  LQG  error):  ej  = 

Note  that  these  errors  are  linearly  dependent:  —  'K  + 

e*.  Also,  defining  8uk  =  uk  —  i/k  and  8zk  =  zk  —  z?k  := 
zk-h(^k,  0),  we  can  rewrite  the  above  linearized  models  as 
follows: 


2k+ 1  =  ^kek  +  +  GkWk  (71a) 

=  Hkek+Mkvk  (71b) 


In  the  update  step,  the  mean  and  covariance  of  posterior 
are  computed.  For  the  system  in  Equation  (71),  the  update 


step  is 

Kk  =  P;Hj(Hkp-Hj  +  MkRkMTk)~x  (77) 

^ £-i_i  =  ^+1  T  ^jfe+i  ( $Zk+i  Hk+  i^_(_i)  (78) 

=  (7  —  Kk+\Hk+i)  Pk+l  (79) 

Note  that 

-  Efelzotit]  =  4  + ^  =  A  +  E[ei|zo:it]  (80) 

Pk  =  C[jct|zb:t]  =  p£  =  C[ek\zo:k]  (81) 


LQR  controller.  Once  we  obtain  the  belief  from  the  filter, 
a  controller  can  generate  an  optimal  control  signal  accord¬ 
ingly.  In  other  words,  we  have  a  time- varying  mapping  \ik 
from  the  belief  space  into  the  control  space  that  generates 
an  optimal  control  based  on  the  given  belief  uk  —  fik{  bk)  at 
every  time  step  k.  The  LQR  controller  is  of  this  kind  and  it 
is  optimal  in  the  sense  of  minimizing  the  following  cost: 


Jlqr  = 


E 


yy  x+  -  <)t  wx(x+  ~  4)  +( uk  -  i/k)T  wu(  Uk  -  i/k) 


Lk>0 

=  E 


J](  e+)T  Wx( e+)  +  ( 8uk )T  Wu(  8uk) 


Lk>0 


(82) 


The  linear  control  law  that  minimizes  this  cost  function  for 
a  linear  system  is  of  the  form 

8uk  =  -Lke+  (83) 


where  the  time-varying  feedback  gains  Lk  can  be  computed 
recursively  as  follows: 


KF\  In  Kalman  filtering,  we  aim  to  provide  an  estimate  of 
the  system’s  state  based  on  the  available  partial  information 
we  have  obtained  up  to  time  k,  that  is,  zo±-  The  state  esti¬ 
mate  is  a  random  vector  denoted  by  xk,  whose  distribution 
is  the  conditional  distribution  of  the  state  on  the  obtained 
observations  so  far,  which  is  called  belief  and  is  denoted 
by  bk. 


bk  =  P(xt)  =  p(xk\Z0:k) 

(72) 

xl  =E[xk\zo-.k] 

(73) 

Pk  =  C[xk\Z0:k\ 

(74) 

where  E[- 1  *]  and  C[-|-]  are  the  conditional  expectation  and 
conditional  covariance  operators,  respectively.  In  the  Gaus¬ 
sian  case,  we  have  bk  =  Af(x£ ,Pk);  in  other  words,  the 
belief  can  be  characterized  only  by  its  mean  and  covariance, 
that  is,  bk  =  (x^,Pk). 

Kalman  filtering  consists  of  two  steps  at  every  time  stage: 
a  prediction  step  and  an  update  step.  In  the  prediction  step, 
the  mean  and  covariance  of  prior  x^  are  computed.  For  the 
system  in  Equation  (71),  the  prediction  step  is 


Lk  =  (BjSk+1Bk  +  Wu)~x  BrkSk+lAk  (84) 
Sk  =  Wx+  AJkSk+lAk  -  AJkSk+lBkLk  (85) 

If  the  nominal  path  is  of  length  N,  then  SN  =  Wx 
is  the  initial  condition  of  the  above  recursion,  which  is 
solved  backwards  in  time.  Note  that  the  full  control  is 
uk  =  Wfc  T-  8uk. 

LQG  controller.  Plugging  the  obtained  LQR  control  law 
into  the  Kalman  filtering  equations,  we  obtain  the  following 
error  dynamics  for  the  defined  errors: 


( &k+\  \  _  ( Ak  RkLk  PkLk  \  /  &k  \ 

\£fc+ 1  /V  0  Ak  —  Kk+\Hk+\Ak  )  \ek  ) 

+  [  Gk  0  \  /  wk  \ 

\Gk  —  Kk+\Hk+iGk  —Kk+\Mk+i  J  yvfc+i  J 

(86) 


or  equivalently, 

(  ek+\  \  (  Ak  ~BkLk  \  f  ek\ 

V^+i  J  V  Kk+lHk+1AkAk  -  BkLk  -  Kk+lHk+lAk)  \  e+ ) 


^k+ 1  — Ak&k  Bk8uk  (75) 

Pk+l=AkP+Aj  +  GkQkGl  (76) 


+ 


/  Gk 
\Kk+lHk+lGk 
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Defining  ft  :=  (e^)1  and  qk  :=  we  can 

rewrite  Equation  (87)  in  a  more  compact  form  as 


ft+l  =  F  ktk  +  GkQk,  Qk 


Ok 


Ok  o  \ 

0  Rk+i  ) 


mo  ,0*), 


(88) 


with  appropriate  definitions  for  F~k  and  Gk. 

The  above  equation,  along  with  the  equation  on  estima¬ 
tion  covariance  propagation, 

Pk+i  =  d  ~ Kk+iHk+i){AkPkAl+GkQkGl )  (89) 


characterize  the  evolution  of  state  xk  and  belief  bk  = 
(x£,Pk)  under  the  time- varying  LQG  controller. 


C.  SLQG  controller 


Again,  similar  to  the  time-varying  case,  since  we  only 
have  imperfect  information  about  the  state  xk,  we  have  to 
make  the  estimate  about  the  state  based  on  the  avail¬ 
able  observations  z0*.  Accordingly,  the  controller  gener¬ 
ates  the  control  signal  based  on  the  estimated  value  of  the 
state;  i.e.,  belief.  Based  on  the  separation  principle  (Bert- 
sekas,  2007),  in  a  linear  system  with  Gaussian  noise,  min¬ 
imization  of  the  cost  in  Equation  (92)  is  equivalent  to  per¬ 
forming  two  separate  minimizations  that  lead  to  the  sep¬ 
arate  design  of  the  SKF  and  SLQR.  In  the  following,  we 
first  discuss  the  linearization  of  a  nonlinear  model,  and  then 
we  discuss  how  the  SKF  and  the  SLQR  can  be  designed 
for  this  linearized  system,  and  finally,  we  combine  them  to 
construct  an  SLQG  controller. 

Model  linearization:  Given  a  desired  point  xp,  we  lin¬ 
earize  the  dynamics  and  observation  model  in  Equation  (90) 
as  follows: 


The  SLQG  controller  is  often  used  to  regulate  (or  stabilize) 
the  system  state  to  a  pre-planned  point  (also  called  the  set- 
point,  nominal,  or  desired  point)  in  the  presence  of  process 
and  observation  noise.  In  principal  it  is  designed  (and  opti¬ 
mal)  for  linear  systems  with  Gaussian  noise,  but  it  can  also 
be  utilized  for  stabilizing  nonlinear  systems  locally  around 
the  planned  point.  The  SLQG  controller  is  composed  of  a 
stationary  Kalman  filter  (SKF)  as  an  estimator  and  a  sta¬ 
tionary  linear  quadratic  regulator  (SLQR)  as  a  controller. 
At  every  time  step  k,  the  SK  F  provides  the  a  posteriori  dis¬ 
tribution  (belief)  bk  over  the  system  state,  and  the  SLQR 
generates  control  uk  based  on  bk. 

In  this  appendix,  we  first  discuss  the  system  linearization 
around  the  planned  point,  and  then  discuss  the  SK  F,  SLQR, 
and  SLQG  corresponding  to  this  nominal  point.  Consider 
the  nonlinear  partially  observable  state-space  equations  of 
the  system  as  follows: 

Xk+i  =  f(Xk,uklWk),  wk~M0,Qk)  (90a) 
zk  =  h(xklVk),  vk  ~ U(0,Rk)  (90b) 

and  consider  a  planned  state  point  xp,  to  whose  vicinity 
the  controller  has  to  drive  the  system.  If  the  system  state 
reaches  the  point  xp,  it  is  assumed  that  the  system  remains 
there  with  zero  control,  up  =  0,  in  other  words, 


Xfc+i  =  f(xp,0,0)+As(xk  -xp)+Bs(uk  -0) 

+GsW(c,  wk~m0,Qs)  (93a) 

zk  =  h(xp,0)+Hs(xk-xp)+Msvk,  vk  ~m0,Rs) 

(93b) 

where 

As  =  ^(xp, 0,0),  Bs  =  ^(xp, 0,0),  Gs  =  ^—(xp, 0, 0) 
9X  dll  dW 

,,  Oh,  n  ^  „  dh .  „ 

Hs  = —  xp,0  ,  Ms  =  —  xp,0  94 

dx  dv 

Now,  let  us  define  the  foil  owing  errors: 

•  SLQG  error  (main  error):  ek  =  xk  -xp 

•  SKF  error  (estimation  error):  ek  =  xk  -7£,  whereT^  = 
E[xjf]  =E[Xfc|Z0*] 

•  SLQR  error  (estimation  of  SLQG  error): =K£ -xp 

N  ote  that  these  errors  are  I  i  nearly  dependent:  ek  =  +  ek . 

Defining  Suk  :=  uk  and  Szk  :=  zk-zp  =  zk  -  h(xp,  0),  we 
can  rewrite  the  above  linearized  models  as  follows: 

e*:+i  =  Asek  +BsSuk  +  Gswk  (95a) 

Szk  =  H  sek  +  M  svk  (95b) 


xp  =  f(xp,  0,0)  (91) 

The  role  of  a  closed-loop  stochastic  controller  during  the 
state  regulation  is  to  compensate  for  robot  deviations  from 
the  desired  point  due  to  noise  effects,  and  to  drive  the  robot 
close  to  the  desired  point  in  the  sense  of  minimizing  the 
following  quadratic  cost: 


i 


=  E 


^(X(c  -XP)T  Wx(Xk  -Xp)+(Uk)T  Wu(Uk) 
_k>  0 


(92) 


where  Wx  and  Wu  are  positive-definite  weight  matrices  for 
the  state  and  control  costs,  respectively. 


SKF:  In  SKF,  we  aim  to  provide  an  estimate  of  the  sys¬ 
tem's  state  based  on  the  available  partial  information  we 
have  obtained  up  to  time  k,  that  is,  z0±.  The  state  esti¬ 
mate  is  a  random  vector  denoted  by  xjf,  whose  distribution 
is  the  conditional  distribution  of  the  state  on  the  obtained 
observations  so  far,  which  is  called  belief  and  is  denoted  by 
bk  =  p(xjf)=  p(xk\zo±).  In  the  Gaussian  case,  the  belief 
can  only  be  characterized  by  its  mean  and  covariance,  that 
is,  bk  =  Itf.Pk).  Thus,  in  the  Gaussian  case,  we  can  write 


bk  =  p(x£)  =  p(xk\z0:k)  =  ,P k)  o  bk  =  (Z£,Pk) 

(96) 

X  =  E[x*|z0*],  Pk  =  C[Xfc|Z0*] 


(97) 
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where  E[-|-]  and  C[-|  ]  are  the  conditional  expectation  and 
conditional  covariance  operators,  respectively. 

SKF  consists  of  two  steps  at  every  time  stage:  a  predic¬ 
tion  step  and  an  update  step.  In  the  prediction  step,  the  mean 
and  covariance  of  prior  are  computed.  For  the  system  in 
Equation  (95)  the  prediction  step  is 


®k+i  —  ~t~BsSu k  (98) 

PM=AsPtA]  +  GsQsG]  (99) 

I  n  the  update  step,  the  mean  and  covariance  of  posterior 
are  computed.  For  the  error  system  in  Equation  (95),  the 
update  step  is 

Kk=P;HJs(HsP;HJs  (100) 

6/J+l  =  %+l  +  K(;+l(  SZk+l  —  Hs^+1)  (101) 

P++1=(/  -/Cfc+iHs)P,-+1  (102) 

N  ote  that 


where  the  stationary  feedback  gain  Ls  can  be  computed  as 
follows: 

Ls  =  (B]S5Bs  +  Wu)-1B]SsAs  (109) 

ss  =  i/i/x+aJssas-a;[ssbsls  (no) 

where  the  second  equation  is  indeed  a  DARE  that  can  be 
efficiently  solved  for  Ss.  Plugging  Ss  into  Equation  (109), 
we  get  the  feedback  gain  Ls. 

SLQG  controller :  Plugging  the  obtained  control  law  of 
SLQR  into  the  SKF  equations,  we  can  get  the  following 
stationary  dynamics  for  the  defined  errors: 

(ek+1\_(As-BsLs  BSLS  \{ek\ 
\ek+1)-\  0  As-KsHsAsJXSk) 

+  (gs-Gkshsgs  (111) 

or  equivalently, 


x+  =  xP+-e+,  Pk  =  P  +  (103) 

In  SKF,  if  ( AS,HS )  is  an  observable  pair  and  (AS,QS)  is  a 
controllable  pair,  where  GsQsG]  =  OsQ],  then  the  prior 
and  posterior  covariances  P^r  and  Pk  and  the  filter  gain  Kk 
all  converge  to  their  stationary  values,  denoted  by  P~,  Ps, 
and  Ks,  respectively  (Bertsekas,  2007).  P~  can  be  computed 
by  solving  the  following  DARE  in  Equation  (104).  H avi ng 
P~,  the  stationary  gain  Ks  and  estimation  covariance  Ps  are 
computed  as  follows: 

p;  =  gsqsg  J 

+  As(P--P-Hj(HsP~Hj  +MSRSMJ)-1HSP-)AJ 

(104) 

KS  =  P-HJ(HSP~HJ  A-MsRsM])-1  (105) 

Ps  =  U  ~KSHS)P;  (106) 


/ek+i\_/  As  -BSLS  \(ek\ 

\K+i)  ~  \KsHsAs  As-BsLs-KsHsAs){et) 

+  (kshssgs  KSM s )  ( v^+! )  (112) 

Defining  &  :=  (e^)7  and  qk  :=  (wfc,V(c+i)T,  we  can 
rewrite  Equation  (112)  in  a  more  compact  form  as 

ft+i  =  Fs<tk  +  Gsqk,  qk  ~  Af(  0,QS) ,  Qs  =  R°J 

(113) 

with  appropriate  definitions  forF~s  and  (js- 
It  can  be  shown  that  if  F~s  is  a  stable  matrix  (i.e. 
linWoof F~S)K  =  0),  &  converges  i.d.  to  ;s  ~  Af( 0,VS). 
Stationary  covariance  Vs  is  the  solution  of  the  following 
Lyapunov  equation: 


SLQR  controller:  In  the  SLQR  we  have  a  stationary  map¬ 
ping  \±s  from  the  belief  space  to  the  control  space  that 
generates  an  optimal  control  based  on  the  given  belief  u k  = 
/is(  bk)  at  every  time  step  k.  The  SLQR  controller  is  optimal 
in  the  sense  of  minimizing  the  following  cost: 


J  SLQR  =  E 


£(*?  -  Xp)7  WA*t  ~  Xp)  +( uk)T  Wu(  uk) 

_k>  0 


=  E 


J](^)T  l/l/x(^+)+(Su*)T  Wu(Suk) 

_k>  0 


(107) 


If  (ASlBs)  is  a  controllable  pair  and  (As,l/I/X)  is  an  observ¬ 
able  pair,  where  l/l/xTl/l/x  =  Wx,  then  the  stationary  linear 
control  law  that  minimizes  the  cost  function  ] Slqr  for  a 
linear  system  is  of  the  form 

Suk  =  -Ls gf  (108) 


Vs  =  FsVsF]  +GsQsGJs  (114) 
Note  that  "Ps  can  be  decomposed  into  four  blocks, 


Vs  = 


Vs, 12  \ 
Vs, 22  ) 


(115) 


such  that  Vs, 11  =  lirrWooCte*]  anc|  ps22  = 

lim^ooC^].  Therefore,  sincet^  =  xp  +'§^,  the  estima¬ 
tion  mean  also  converges  to  a  stationary  random  variable, 
denoted  by  x+: 


:=  limlft  ~  Af(xp,Vs,22)  (116) 

k — >-00 

Due  to  the  linear  relation  ek  =  +  ek,  we  can  also 

conclude  lim^oo  C[2fc]  =  Vs,n  +  Vs,22  -  2Vs,n ■  It  can  be 
proven  that  in  SLQG,  the  stability  of  matrix  F~s  is  a  direct 
consequence  of  the  controllability  of  pair  ( AS,BS )  and  the 
observability  of  pair  (AS,HS)  (Bertsekas,  1976,  2007). 
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Thus,  collecting  all  the  conditions,  if  and 

( As,Qs )  are  controllable  pairs,  where  GsQsGj  =  QSQ], 
and  if  ( AS,HS )  and  (AS,WX)  are  observable  pairs,  where 
Wx  =  WjWX9  then  the  belief  bk  converges  in  distribution 
to  a  stationary  belief  under  the  SLQG: 

bs  :=  lim  bk=N(x+,K)  (117) 

k — >  oo 


where  Pf  is  a  deterministic  quantity  and  we  can  character¬ 
ize  the  distribution  over  the  stationary  belief  as 


xP 

Pt 


(118) 


estimation  mean  at  v,  the  probability  of  hitting  the  stopping 
region  is  maximized  and  the  stopping  time  is  minimized. 

Combining  the  results  for  estimation  covariance  and  esti¬ 
mation  mean,  if  we  define  the  region  B  as  a  set  in  the 
Gaussian  belief  space  with  a  non-empty  interior  centered 
at  ( v,  Py),  the  belief  bk  =  Pk)  enters  region  B  in  finite 
time  with  probability  one.  Thus,  the  pair  ( B ,  /z)  is  a  proper 
pair  over  GB;  in  other  words,  B  is  reachable  under  /z  starting 
from  any  Gaussian  distribution.  □ 

E.  Proof  of  Lemma  4 

Before  proving  Lemma  4,  we  state  and  prove  the  following 
lemma. 


D.  Proof  of  Lemma  3 

Proof  Let  us  consider  the  state-space  model  of  the  linear 
system  of  interest  as  follows: 

xk+i=  Axk  +  B uk  +  G wk9  wk  ~  J\T(  0,  Q)  (1 19a) 

zk=  Uxk  +  vjk,  vk  ~  A f(  0,  R)  (119b) 

Based  on  Lemma  1,  if  (A, B)  and  (A,  Q)  are  controllable 
pairs,  where  GQGT  =  QQT,  and  if  ( A,  H)  and  ( A,  Wx)  are 
observable  pairs,  where  Wx  =  WjWx,  then  the  estimation 
covariance  deterministically  tends  to  a  stationary  covari¬ 
ance  Ps.  Therefore,  for  any  €  >  0,  after  a  deterministic 
finite  time,  Pk  enters  the  e -neighborhood  of  the  stationary 
covariance,  denoted  by  Ps. 

The  estimation  mean  dynamics,  however,  are  stochastic 
and  are  as  follows  for  the  system  in  Equation  (119): 

x++1  =  v+(  A  -  BL  -  K*+1HA)(*+  -  v) 

+  K£+iHA(x£  —  v)  +K£+iHGw£  +  K^+iv^+i 
=  v— ( A  —  BL)  v+(  A  -  BL  -  Kk+lHA)x+ 

+  Kk+iHAxk  +  K,+iHGw,  +  Kk+\vk+i  (120) 

where  the  Kalman  gain  Kk  is 

K,  =  P^Hr(Hp-Hr  +  R)1  (121) 

Since  K  is  full  rank  (due  to  the  condition  on  the  rank  of  H), 
and  since  v  and  w  represent  Gaussian  noise,  Equation  (120) 
induces  an  irreducible  Markov  process  over  the  state  space 
(Meyn  and  Tweedie,  2009).  Thus,  if  we  have  a  stopping 
region  for  the  estimation  mean  with  size  €  >  0,  the  esti¬ 
mation  mean  process  will  hit  this  stopping  region  in  finite 
time  (Meyn  and  Tweedie,  2009),  with  probability  one. 

Based  on  the  estimation  mean  dynamics  in  Equation 
(120)  and  the  state  dynamics  in  Appendix  C,  in  the  absence 
of  a  stopping  region,  if  the  estimation  mean  process  and 
state  process  start  from  vj  and  v0  respectively,  such  that 
E[xJ]  =  v  and  E[xq]  =  v  (which  indeed  is  the  case  in 
FIRM  due  to  the  usage  of  edge  controllers),  ‘the  mean  of 
estimation  mean’  remains  on  v.  That  is,  ]  =  v,  for 
all  k.  As  a  result,  if  we  center  the  stopping  region  for  the 


Lemma  5.  Consider  the  bounded  function  0  <f(X)<  1, 
and  kernel  k(  X\  X)  >  0.  Then,  for  any  set  A,  we  have 

||  J  k(X',X)f(X')dX' ||  <  ||  J  k(X',X)dX'\\  (122) 

A  A 

Proof  Given  the  properties  of  /(•)  and  k(  •,  •)>  we  have 
k(  X',  X)f(  X')  <  k(X',Xf  for  all  X  and  AA  Taking 
the  integral  from  both  sides  with  respect  to  X'  and  then 
taking  the  supremum  norm  with  respect  to  A\  the  result 
follows.  □ 

Now  we  prove  Lemma  4. 

Proof  If  we  denote  the  domain  of  operator  by  79,  we 
know  that  for  all  /  e  V  we  have  0  <  /( X)  <  1,  because 
/( X)  is  the  probability  of  reaching  given  set  S  under  some 
given  controller  invoked  at  point  A\  Thus,  it  cannot  be 
negative  or  greater  than  one,  and  based  on  Lemma  5,  we 
have 

T  s\f]  =  J  p^X'\X)f(X')dX' 

S 

<  J p^(X'\X)dX'  =  Fi(S\X,n)<  1 

S 

(123) 

Therefore,  based  on  the  definition  of  operator  norm,  we 
have 

WTsWop  =  suplllLstnil  :  V/  e  79, \f\\  <  1}  <  1  (124) 

/(•) 

According  to  Assumption  1 ,  there  exists  a  finite  number  N 
such  that 

infP„(<S|A\/z)=£  >  0  Wn  >  N  (125) 
?c 

where  ‘inf’  and  ‘sup’  denote  the  infimum  and  supremum, 
respectively.  Thus,  we  have 

||P„(  S\X,  p)  II  =  sup(  1  -  P„(  S\X,  p)) 

X 

=  1  -inf¥n(S\X,fi) 

=  1  -  p  <  1  Wn  >  N  (126) 
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Let  us  denote  the  nih  iterated  kernel  of  as 
pn(  X'\X,  p).  Since  this  iterated  kernel  is  a  pdf,  we  have 
pn(  X'\X,  pi)  >  0,  VX,  VX',  Vn.  We  can  write 

IIT %[f\  ||  =  ||  f  pN{X'\X,pOf{X’)dX'\\ 
s 

<  II  J Pn( X'\X,n)dX'\\  =  \\Fn(S\X,v)  II  <  «  <  1 
s 

(127) 


G.  Proof  of  Proposition  1 

We  first  state  and  prove  the  following  lemma  on  the  con¬ 
tinuity  of  the  transition  probability  in  the  local  controller’s 
parameter. 

Lemma  6.  Given  Assumption  2,  there  exists  a  <  oc  such 

that 

\\p(X'\X,n(b;\))- p(X'\X,ii(b;x))  II  <  c2||v-v|| 

(129) 


where  a  =  1  —  /3,  and  similar  to  Equation  (124),  we  get 
\\T%Wop  <  oc  <  1.  From  the  operator  norm  properties,  we 
have 

l|T^+X<  l|TgM|T5lU,<«<  1 

and  similarly  for  all  n  >  A,  we  have 

l|TSII^<a<l  Wn>N 

Now,  consider  the  series:  HTJIIop.  We  can  split  the 

sum  into  smaller  pieces  as  follows: 

oo  N  oo  (i+l)N 

Era^Era^  +  E  E  ra* 

n—  1  n=  1  i=l  n—iN+ 1 

But  because  ||T^+1  \\op  <  ||T^||0/?  for  all  n  >  N,  we  have 

(i+m 

E  imiu><*iiTgx 

n—iN- 1-1 

Also,  we  know 

l|T^||op<  ||T"||‘p<a'' 

and  thus,  we  have 

oo  N  oo  (z'+ 1  )N 

Era^Era^+E  E  ii Ts\\oP 

n=  1  n—  1  i'=l  n=iN+ 1 

<N 

OO  A7 

X — '  r  A 

<  IV  +  >  Na  =  N  H - =  c  <  oo 

^  1  -  a 

1=1 

□ 


Proof.  The  result  directly  follows  by  combining  the  two 
parts  of  Assumption  2.  □ 

Now  we  are  ready  to  prove  Proposition  1. 

Proof  To  show  F(B\X,  p)  is  continuous  in  v,  we  perturb 
v  to  some  v,  such  that  ||v  —  v||  <  r.  The  local  controller 
associated  with  node  v  is  referred  to  as  p,  whose  success¬ 
ful  absorption  region  is  denoted  by  B  and  whose  stopping 
region  is  S.  Similarly,  the  corresponding  transient  operator 
and  recurrent  function  are  referred  to  as  and  R  respec¬ 
tively.  Finally,  the  success  probability  associated  with  the 
perturbed  node  v  is  F(B\X,  p).  To  shorten  the  statements, 
we  refer  to  F(B\X,  p)  and  F(B\X,  p)  respectively  as  ^3(  X) 
and  X).  As  a  result  of  node  perturbation,  the  success 
probability  is  perturbed  as 

P(  B\ X,  p)-  P(  B\ X,  p):= %  - $  =  R+Ts PP]  —  R ~ T ^  [$] 

=  r  -rx ts  m  -ts  m  +t5  m  -t  ^  m  +t<s  m  -t  $  m 

=  (R-R)  +T<s  [qj  -  q?]+(  t5  -t  <,)  pp]+(  T^  -t  s)  m 
where 


T s  !/(•)  ]  ( X)  :=  J p>‘(  X'l X)f(  X')  dX'  (130) 
s 

Let  us  define  the  operators  TA<s  •=  ( —  T^)  and  AT ^  := 
( —  T^).  Now,  based  on  Corollary  1,  we  can  write 


-$  =  (/- Is)-1 


J?-K  +  TA5[<p]  + AT^PP] 


(131) 


F.  Proof  of  Corollary  1 

Proof  We  know  \\R\\  <  1,  and  thus  we  can  write 

oo  oo  oo 

II  ETM  <  E  IITsMWI  <  E  IITsIlop  <  C  <  oo 

n= 0  n= 0  n= 0 

Thus,  series  TJfR]  is  a  convergent  series  and  we  can 
define  the  operator  (I  —  Ts)~l  [R]  =  •  We  have 

oo 

11(7  -  T5)-1  ||op  =  ||  ET5lU>  <  c  <  OO  (128) 

n= 0 

□ 


and  thus  the  following  inequality  holds  on  the  supremum 
norm  of  the  perturbation  of  the  absorption  probability: 

113* -VII 

<  ||(/-T5r1LP(l|/?-^ll  +  l|TA5[0]||  +  ||ATi[$]||) 

<  c(||tf-tf||  +  ||Ta5[$]||  +  ||AT5[<p]||) 

=  c(\\Kl(X)  ||  +  || K2(X)  ||  +  \\K3(X)  II)  (132) 

where  Kx{X)  :=  R(  X)  -R(  X),  K2(  X)  :=  TASPP(-) ]( X), 
and  K^{X)\=  A  [*$(•)  ]( X).  In  the  following  we  bound 
K\,  K2,  and  A3,  and  thus  bound  ||qj  —  Vll,  accordingly. 
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G.l.  Bound for  KfX) 

The  supremum  norm  of  K\(  X)  is 

\\K\(X)  ||  =  \\R(X)-R(X)\\ 

=  II  J p^(X'\X)  dX'  -  J p^(X'\X)  dX'\\ 

B  B 

=  11  f  \p»(X’\X)-p\X'\X)]dX’ 

BnB 

+  J  p^(X’\X)  dX'  -  J  p^(X'\X)  dX'W 

B-B  B-B 

<  j  \\p^X'\X)-p\X'\X)\\dX' 

BnB 

+  11  J  p^(X'\X)dX'+  J  p*(X'\X)  dX'W 

B-B  B-B 

from  (129)  C 

<  j  c2\\v-v\\dX' +  Wx(BeB\X,pL)\\ 

BnB 

+  ||P1(B©B|A\/x)|| 

from  (50) 

<  c'||v-v|l+2c'||v-v||  =nl|v-v||  (133) 


G.3.  Bound  for  K3(X) 

We  have 

II K3(X)  II  =  ||ATi[$]||  =  ||Tj[<p]  -  Tsm\ 

=  11  J /+(<+' \X)<$(X')  dX'-Jp^iX' \X)<fo(X')  dX'\\ 
s  s 

=  11  J  (p^X'\X)-p^X'\X))^(Xf  dX’W 

S 

<  j  Wp^X'\X)-p\X'\X)  m(X')\\dX' 

S 

from  (129)  C 

<  J  c2\\y-  v||  dX'  =  y3||v-  v||  (135) 

s 

where  y^  <  oo. 

Therefore,  based  on  Equations  (132)— (135),  we  can  con¬ 
clude  that 

l|P(  B\X,  /x)  -P(  B\X,  ft)  ||  <  Y  lly  -  vll  (136) 

where  y  =  c(  y\  +  y2  +  Y3)  <  00,  which  completes  the 
proof  that  the  absorption  probability  under  the  controller  /x 
is  continuous  in  the  PRM  node  v.  □ 


where  c'2  <  00  and  y\  =  c'2  +  2d  <  00.  In  the  penulti¬ 
mate  inequality,  we  also  used  the  fact  that  Pi(  B— B\X,jl)< 
P i(B  ©  B\X,ji)  and  P fB  -  B|A\/x)  <  P fB  ©  B\X,pv) 
because  B  —  B  c  B  ©  B  and  B  -  B  c  B  ©  B. 


G.2.  Bound  for  K2(X) 

We  have 


\\k2(x)  11  =  \\TASm\  =  iit5[$]  -TSm\ 

=  11  J /+(+' | X)$(X')  dX'  -  J p^(X' \X)¥(X')  dX'W 

=  ||  J p*(X' \X)%(X')  dX'  -  J p^(X' \X)ty(X')  dX'\\ 
S-S 

<lf, 


S-S  S-S 

pli(X'\X)<$(X')  dX'  +j p^(X' \X)¥(X')  dX'W 


S-S 


S-S 


/v  from  (122)  f 

p^x’ \X)WX’)  dx' II  <  II J  pp(x'\x)  dX'W 

SqS  SqS 

=  wp^se^x.n)  11  <  wFdBeftx.n) n 

v  from  (50) 

=  ||Pi(£©£|T\/x)||  <  K2l|v-v||  (134) 


where  y2  =  c'  <  00.  The  penultimate  inequality  and 
equality  follow  from  the  relations  S  ©  S'  c  B  ©  B'  and 
B  ©  B'  =  B  ©  B',  respectively. 


H.  Proof  of  Theorem  1 

Before  starting  the  proof  of  Theorem  1,  we  state  the  follow¬ 
ing  proposition  that  concludes  the  continuity  of  the  success 
probability  of  tv  (overall  planner)  given  the  continuity  of  the 
success  probability  of  pdj  (the  individual  local  planners). 

Proposition  2.  (Continuity  of  success  probability  of  tv)  The 
success  probability  P(  success \bo,  tv)  is  continuous  in  V  if 
the  absorption  probabilities  F(&\b,  /x77)  are  continuous  in 
v7  for  all  i,  j ,  and  b. 

Proof.  Given  that  F(B^\b,  fi^)  is  continuous  in  v7,  for  all 
ij ,  we  want  to  show  that  P(  success \n,  bo)  is  continuous 
in  all  v7.  First,  let  us  look  at  the  structure  of  the  success 
probability: 

P(  success|Z?0>  tt)  =  F(B(  Uo)  \bo,  Uo)  P(  success|#(  /x0) ,  tv8) 

(137) 

where  /xo  is  computed  using  Equation  (34).  The  term 
P(B(/x0)  \b0,  Uo)  on  the  right-hand  side  of  Equation  (137) 
is  continuous  because  the  continuity  of  F(W\b,  /x77)  for  all 
ij  is  assumed  in  this  proposition.  Thus,  we  only  need  to 
show  the  continuity  of  the  second  term  in  Equation  (137). 
Without  loss  of  generality  we  can  consider  Bl  =  B(/io)- 
Then,  we  need  to  show  that  P(  success  | B\  tv8)  is  continuous 
in  v*  for  all  i. 

As  we  saw  in  Section  6.7,  the  probability  of  success  from 
the  ith  FIRM  node  is  as  follows: 

P(success|fi',jrs)=  rj(l  -  Q)_1  Kg  (138) 
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Moreover,  we  can  consider  Bgoal  =  BN  without  loss  of  gen¬ 
erality;  then,  the  ( i,j)th  element  of  matrix  Q  is  Q[i,j]  = 
F(Bl\Bi,7T8(Bi)),  and  the  7th  element  of  vector  7 Zg  is 
Kg\j]  =  P( BN \& ,  jt8( &) ). 

Since  we  considered  Bf  as  the  stopping  region  of  the  local 
controller  pff  we  have 

if/ ^7  (139) 

Therefore,  all  non-zero  elements  in  the  matrices  and 
Q  are  of  the  form  F(ff\Bl,  pfi).  Thus,  given  the  continu¬ 
ity  of  F(&\b,  the  transition  probability  F(B^\B\  fiij) 
is  continuous  and  the  matrices  7 Zg  and  Q  are  continuous. 
Therefore,  F(  success  tv8)  and  thus  F(  success \b$,  it)  are 
continuous  in  underlying  PRM  nodes.  □ 

Now  we  are  ready  to  prove  Theorem  1 . 

Proof.  Based  on  the  definition  of  probabilistic  complete¬ 
ness  under  uncertainty,  if  there  exists  a  successful  policy 
ft,  FIRM  has  to  find  a  successful  policy  it  as  the  num¬ 
ber  of  FIRM  nodes  increases  unboundedly.  Thus,  we  start 
by  assuming  that  there  exists  a  successful  policy  i  G  II 
for  a  given  initial  belief  bo.  Since  each  policy  in  II  is 
parametrized  by  a  PRM  graph,  there  exists  a  PRM  with 
nodes  V  =  {v1}^  that  parametrizes  the  policy  ft.  Since 
ft  is  a  successful  policy,  we  know  F(  success \bo,  ft)  >  pmin. 
Thus,  we  can  define  g*  =  F(  success \bo,  ft)  —pmin  >  0. 


Given  Assumptions  1,  2,  and  3,  and  based  on 
Propositions  1  and  2,  we  know  that  F(  success \bo,it) 
is  continuous  with  respect  to  the  parameters  of  the 
local  planners.  In  other  words,  for  any  g  >  0,  there 
exists  a  8  >  0  such  that  if  ||V  —  V||  <  8 ,  then 
|F(  success|Z?o,  tt(  •;  V))  — F(  success|Z?o,  tt(  •;  V))  |  <  g.  The 
notation  ||  V  —  V||  <  8  means  that  ||v*  —  yl||  <  8 ,  for  all  i,  or 
equivalently,  v'  e  for  all  i,  where  Q /  is  a  ball  with  radius 
8 ,  centred  at  v*. 

Therefore,  for  the  introduced  g*,  there  exists  a  5*  and 
corresponding  regions  {£2/}^  such  that  if  we  have  a  PRM 
whose  nodes  (or  a  subset  of  whose  nodes:  a  subset  of 
nodes  is  sufficient,  because  the  success  probability  is  a  non¬ 
decreasing  function  in  terms  of  the  number  of  nodes)  satisfy 
the  condition  v*  e  Qi  for  all  i  =  1, . . . ,  N,  then  the  plan¬ 
ner  Tt  parametrized  by  this  PRM  has  a  success  probability 
greater  than pmin ,  that  is,  F(  success \bo,  7r(  •;  V))  >  pmin,  and 
hence  n  is  successful. 

Since  8  >  0,  the  regions  £2,  have  non-empty  interiors. 
Consider  a  PRM  with  a  sampling  algorithm  under  which 
there  is  a  non-zero  probability  of  sampling  in  Qi9  such 
as  uniform  sampling.  Then,  starting  with  any  PRM,  if  we 
increase  the  number  of  nodes,  a  PRM  node  will  eventually 
be  chosen  at  every  with  probability  one.  Therefore  the 
policy  constructed  based  on  these  nodes  will  have  a  success 
probability  greater  than  pmin\  in  other  words,  we  eventu¬ 
ally  get  a  successful  policy  if  one  exists.  Thus,  FIRM  is 
probabilistically  complete  under  uncertainty.  □ 
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Abstract 

This  paper  presents  a  strategy  for  stochastic  control  of  small  aerial  vehicles  under  uncertainty  using  graph-based  methods. 
In  planning  with  graph-based  methods,  such  as  the  Probabilistic  Roadmap  Method  (PRM)  in  state  space  or  the  Information 
RoadMaps  (IRM)  in  information- state  (belief)  space,  the  local  planners  (along  the  edges)  are  responsible  to  drive  the  state/belief 
to  the  final  node  of  the  edge.  However,  for  aerial  vehicles  with  minimum  velocity  constraints,  driving  the  system  belief  to  a 
sampled  belief  is  a  challenge.  In  this  paper,  we  propose  a  novel  method  based  on  periodic  controllers,  in  which  instead  of 
stabilizing  the  belief  to  a  predefined  probability  distribution,  the  belief  is  stabilized  to  an  orbit  (periodic  path)  of  probability 
distributions.  Choosing  nodes  along  these  orbits,  the  node  reachability  in  belief  space  is  achieved  and  we  can  form  a  graph  in 
belief  space  that  can  handle  higher-order-dynamics  or  non-stoppable  systems  (whose  velocity  cannot  be  zero),  such  as  fixed- 
wing  aircraft.  The  proposed  method  takes  obstacles  into  account  and  provides  a  query-independent  graph,  since  its  edge  costs 
are  independent  of  each  other.  Thus,  it  satisfies  the  principle  of  optimality.  Therefore,  dynamic  programming  can  be  utilized 
to  compute  the  best  feedback  on  the  graph.  We  demonstrate  the  method’s  performance  on  a  unicycle  robot  and  a  six  degrees 
of  freedom  small  aerial  vehicle. 


I.  Introduction 

This  paper  is  concerned  with  the  stochastic  control  problem  for  two  classes  of  systems:  (i)  systems  with  dynamics, 
i.e.,  systems  whose  state  is  composed  of  position  and  its  higher  order  derivatives  such  as  velocity  and  acceleration,  and  (ii) 
systems  with  kinodynamical  constraints,  in  particular,  systems,  whose  velocity  cannot  fall  below  a  certain  threshold  (referred 
to  as  non-stoppable  systems  in  this  paper).  For  example,  consider  a  control  problem  where  the  system  state  is  composed  of 
the  position  and  velocity  (x,  x )  of  an  object.  Stabilizing  this  system  to  a  state  (x  =  a,  x  =  b)  where  b  0  is  not  possible 
since  to  stabilize  the  x  part  to  a,  the  x  must  go  to  zero.  As  an  example  for  non-stoppable  systems,  consider  a  system  whose 
state  only  consists  of  position  x ,  but  it  has  constraints  on  its  velocity  x  >  b  >  0.  All  fixed-wing  aircraft  fall  into  this  category 
as  their  velocity  cannot  fall  below  some  threshold  to  maintain  the  lift  requirement.  Thus,  stabilizing  such  systems  to  a  fixed 
state  is  a  challenge.  This  challenge  gets  even  more  difficult  when  this  stabilization  has  to  be  achieved  under  uncertainty. 
In  this  paper,  we  propose  a  framework  that  circumvents  the  need  for  point  stabilization  in  graph-based  (roadmap-based) 
methods  by  means  of  stabilization  to  suitably  designed  periodic  maneuvers. 

Motion  planning  under  uncertainty  (MPUU)  is  an  instance  of  the  problem  of  sequential  decision  making  under  uncertainty. 
Considering  the  uncertainty  in  an  object’s  motion,  the  problem  can  be  framed  as  a  stochastic  control  with  perfect  state 
information.  In  the  presence  of  uncertainty  in  sensory  readings,  i.e.,  measurement  noise,  the  state  of  the  system  is  no  longer 
available  for  decision  making.  In  such  a  situation,  a  state  estimation  module  can  provide  a  probability  distribution  (referred 
to  as  information-state  or  belief)  over  all  possible  states  of  the  system,  and  therefore  decision  making  has  to  be  performed 
in  belief  space.  Planning  in  belief  space  in  its  most  general  form  is  formulated  as  a  Partially  Observable  Markov  Decision 
Process  (POMDP)  problem  [8],  [18].  However,  in  general  solving  POMDPs  in  continuous  state,  control,  and  observation 
spaces,  where  many  robotic  problems  reside,  is  a  formidable  challenge. 

Sampling-based  motion  planning  methods  have  shown  great  success  in  dealing  with  many  deterministic  motion  planning 
problems  in  complex  environments  and  are  divided  into  two  main  classes:  (i)  roadmap-based  (graph-based)  methods  such 
as  the  Probabilistic  Roadmap  Method  (PRM)  and  its  variants  [6],  [19],  [20]  and  (ii)  tree-based  methods  such  as  methods  in 
[16],  [19],  [22].  In  deterministic  settings,  tree-based  methods  are  usually  single-query,  i.e.,  their  solution  is  valid  for  a  given 
initial  point  whereas  roadmap-based  methods  are  mainly  multi-query,  i.e.,  the  generated  roadmap  structure  is  independent 
of  the  initial  point.  In  this  sense,  roadmap-based  methods  are  a  suitable  choice  for  extension  to  belief  space  because  the 
solution  of  a  POMDP  is  feedback  over  the  entire  belief  space  and  it  does  not  depend  on  the  initial  belief.  Accordingly, 
restricting  the  attention  to  a  representative  graph  (roadmap)  in  the  space,  the  feedback  can  be  defined  as  a  mapping  from 
its  nodes  to  its  edges. 

Similar  to  motion  planning  in  state  space,  in  belief  space  motion  planning,  the  basic  motion  tasks  can  be  defined  as 
point-to-point  motion ,  which  deals  with  driving  the  belief  of  the  moving  object  from  a  given  belief  to  another  given  belief, 
and  trajectory  following ,  which  deals  with  following  a  trajectory  in  belief  space.  Depending  on  the  kinematics/dynamics  of 
the  system,  these  tasks  might  be  very  challenging  in  the  state  space.  However,  they  often  are  more  challenging  in  belief  space 
even  for  simple  kinematics/dynamics.  To  construct  a  query-independent  roadmap  in  state/belief  space,  point-to-point  motion 
in  state/belief  space  is  required.  Feedback-based  Information  RoadMap  (FIRM)  [3],  [5]  extends  graph-based  methods  to 
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belief  space  by  embedding  the  point-to-point  motion  behavior  in  belief  space  using  belief  stabilizers  (i.e.  stationary  feedback 
controllers),  which  was  a  missing  behavior  in  pioneering  works  such  as  [17],  [24],  [28]. 

As  a  result  of  embedding  the  point-to-point  motion  behavior  in  belief  space,  FIRM  generates  a  graph  in  belief  space  that 
is  query  independent  and  only  needs  to  be  constructed  once  offline.  Establishing  a  connection  between  its  solution  and  the 
original  POMDP  [5],  it  is  shown  that  FIRM  is  probabilistically  complete  [4].  In  [5]  first  FIRM  is  presented  as  an  abstract 
framework  for  graph-based  planning  in  belief  space  and  then  Stationary  Linear  Quadratic  Gaussian-FIRM  (SLQG-FIRM)  is 
presented  as  a  concrete  instantiation  of  the  abstract  FIRM  framework.  The  performance  of  FIRM  has  been  demonstrated  on 
physical  mobile  robots  in  changing  environments  [2].  However,  SLQG-FIRM  is  limited  to  the  systems  that  are  stabilizable 
to  stationary  fixed  points  (with  zero  velocity)  in  the  state  space.  This  excludes  the  class  of  systems  we  consider  in  this  paper. 

The  main  contributions  of  this  paper  are: 

•  Proposing  a  graph-based  solution  for  controlling  small  aerial  vehicles  in  the  presence  of  uncertainty  and  constraints. 
We  accomplish  this  goal  by  proposing  a  concrete  instantiation  of  the  FIRM  framework  that  can  handle  non-stoppable 
systems  (i.e.,  class  of  dynamical  systems  that  are  not  stabilizable  to  a  point  with  zero- velocity),  such  as  fixed-wing 
aircraft. 

•  Accordingly,  transforming  the  intractable  constrained  POMDP  to  a  tractable  dynamic  programming  over  a  graph 
corresponding  to  non-stoppable  systems. 

•  Designing  the  periodic-node  PRM  in  state  space. 

•  Investigating  the  cyclo stationary  behavior  of  the  belief  under  Periodic  Linear  Quadratic  Gaussian  (PLQG)  controllers 
and  designing  a  belief  stabilizer  for  non-stoppable  systems. 

This  paper  is  organized  as  follows.  We  start  by  introducing  the  concept  of  periodic-node  graph  in  state  space,  whose  nodes 
lie  on  periodic  trajectories  referred  to  as  orbit.  In  Section  III,  we  review  the  problem  of  stochastic  optimal  control  with 
imperfect  observations.  Section  IV  constructs  the  abstract  FIRM  framework  based  on  the  underlying  periodic-node  graph. 
In  this  section,  we  show  how  constraints  are  incorporated  in  the  construction  phase  of  the  planner.  Then,  in  Section  V  we 
analyze  the  behavior  of  PLQG  controllers  as  belief  stabilizers  and  accordingly  we  propose  an  approach  to  characterize  and 
select  the  reachable  regions  in  belief  space  under  PLQG  controllers.  As  a  result  we  extend  the  periodic-node  PRM  from 
state  space  to  a  corresponding  graph  in  belief  space.  We  provide  algorithms  for  offline  construction  of  this  graph  and  online 
(re)planning  with  this  graph.  Finally,  in  Section  VI,  we  demonstrate  the  performance  of  the  proposed  method  on  a  planar 
unicycle  model  with  minimum  allowable  velocity  and  on  a  simplified  6DoF  aerial  vehicle  model. 

II.  Periodic-Node  PRM 

An  implicit  assumption  in  graph-based  methods  such  as  PRM  [20]  is  that  on  every  edge  there  exists  a  controller  to  drive 
the  robot  from  the  start  node  of  the  edge  to  the  end  node  of  the  edge  or  to  an  e-neighborhood  of  the  end  node,  for  a 
sufficiently  small  e  >  0.  For  a  linearly  controllable  robot,  a  linear  controller  can  locally  track  a  PRM  edge  and  drive  the 
robot  to  its  endpoint  node.  Obviously,  controlling  non-stoppable  robots  on  a  PRM  roadmap  is  a  challenge,  since  they  have 
constraints  on  their  controls  and  cannot  reduce  their  velocity  below  a  specific  threshold  umin,  and  hence,  stabilization  is  not 
feasible  for  them.  This  task  becomes  more  challenging  if  the  system  is  also  nonholonomic.  In  a  nonholonomic  robot  such 
as  a  unicycle,  the  linearized  model  at  any  point  is  not  controllable,  and  hence,  a  linear  controller  cannot  stabilize  the  robot 
to  the  PRM  nodes.  Consider  the  discrete  unicycle  model: 

(Xfc  4  (Vk  +  nv)St  cos  0k  \ 

Y/c  +  (Vk  +  nv)5t  sin  0k  ,  wk  ~  Af( 0,  Q*)  (1) 

Ok  +  (uk  +  nu)5t  J 

where  xk  =  (x/~,  Y/c,  0k)T  describes  the  robot  state,  in  which  (xk,yk)T  is  the  2D  position  of  the  robot  and  0k  is  the  heading 
angle  of  the  robot,  at  time  step  k.  The  vector  uk  =  (Vk,ujk)T  is  the  control  vector  consisting  of  linear  velocity  Vk  and 
angular  velocity  uk.  The  motion  noise  vector  is  denoted  by  wk  =  (nv,  nw)T.  Linearizing  this  system  about  the  point  (node) 
v  =  (xp,y P,0P),  nominal  control  up  =  (Vp,ujp),  and  zero  noise,  we  get: 

Xk+1  =  f(xk,uk,wk)  «  /(v,  up ,  0)  +  A(xk  -  v)  +  B (uk  -  up )  +  Gwk  (2) 

where  A  =  ^(v,^p,  0),  B  =  ^£(v,i/p,  0),  G  =  0).  Checking  the  rank  of  the  controllability  matrix  of  the 

linearized  system,  we  get:  rank([ B,  AB,  A2B])  =  2  +  I(Vp  >  0),  where  I  is  the  indicator  function,  which  is  one  if  Vp  >  0 
and  is  zero,  otherwise.  Therefore,  if  the  nominal  control  is  zero,  i.e.,  up  =  ( Vp,ujp)t  =  (0,  0)T,  which  is  the  case  when  we 
stabilize  the  robot  to  a  PRM  node,  the  resulting  linear  system  is  not  controllable,  since  rank([B,  AB,  A2B])  =  2  <  3.  Thus, 
a  linear  controller  cannot  stabilize  the  unicycle  to  a  PRM  node.  Moreover,  based  on  the  necessary  condition  in  Brockett’s 
paper  [12],  even  a  smooth  time-invariant  nonlinear  control  law  cannot  drive  the  unicycle  to  a  PRM  node,  and  the  stabilizing 
controller  has  to  be  either  discontinuous  and/or  time- varying. 

On  roadmaps  in  belief  space,  the  situation  is  even  more  complicated,  since  the  controller  has  to  drive  the  probability 
distribution  over  the  state  to  the  e-neighborhood  of  a  belief  node  in  belief  space.  Again,  if  the  linearized  system  in  (2) 
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is  controllable,  using  a  linear  stochastic  controller  such  as  the  stationary  LQG  controller,  one  can  drive  the  robot  belief 
to  the  belief  node  [5].  However,  if  the  system  is  non-stoppable  and/or  its  linearized  model  is  not  controllable,  the  belief 
stabilization,  if  possible,  is  much  more  difficult  than  state  stabilization. 

A.  Periodic-node  PRM 

In  this  paper,  we  circumvent  the  problem  of  stabilization  to  graph  nodes  by  designing  a  variant  of  PRM,  referred  to  as 
Periodic-Node  PRM  (PNPRM).  Although  there  are  different  ways  to  address  this  problem  in  state  space,  the  critical  property 
of  PNPRM  is  that  it  can  be  extended  to  belief  space  to  form  a  graph  whose  nodes  are  beliefs  that  are  reachable  without  a 
point- stabilization  process.  Let  us  denote  the  motion  model  with  Xk+i  =  f(xk^Uk^Wk),  where  state,  control,  and  process 
noise  at  the  k- th  time  step  are  denoted  by  Xk,  Uk,  and  Wk,  respectively. 

Similar  to  traditional  PRM,  PNPRM  also  consists  of  nodes  and  edges.  However,  in  PNPRM,  the  nodes  lie  on  small 
T-periodic  trajectories  (trajectories  with  period  T)  in  the  state  space,  referred  to  as  orbits.  Each  orbit  satisfies  the  control 
constraints  and  non-holonomic  constraints  of  the  moving  robot.  To  construct  a  PNPRM,  we  first  sample  a  set  of  orbits  in 
the  state  space,  and  then  on  each  orbit,  a  number  of  state  nodes  are  selected.  Let  us  denote  the  j-th  orbit  trajectory  by 
O-7  :=  c>o,  where  xk+1  =  f(xp\up\  0),  xk+T  =  xpk ,  and  upk+T  =  upk  .  The  set  of  PNPRM  nodes  that  are 

chosen  on  O-7  is  denoted  by  V-7  =  {vj,  v^,  •  •  •  ,  vJm}  where  vJa  =  xpk  for  some  ka  E  {1,  •  •  •  ,  T}.  Edges  in  PNPRM  do  not 
connect  nodes  to  nodes,  but  they  connect  orbits  to  orbits  in  a  way  that  respects  all  the  control  constraints  and  nonholonomic 
constraints.  Thus,  the  (i,  j)-th  edge  denoted  by  e2-7  connects  Ol  to  OK 

As  a  result,  a  node  vla  is  connected  to  the  node  through  concatenation  of  three  path  segments:  i)  the  first  segment  is 
a  part  of  Ol  that  connects  to  the  starting  point  of  elK  This  part  is  called  pre-edge  and  is  denoted  by  elaK  ii)  the  second 
segment  is  the  edge  e2-7  itself  that  connects  Ol  to  OK  and  Hi)  the  third  segment  is  a  part  of  O-7  that  connects  the  ending 
point  of  e2-7  to  v^.  This  part  is  called  post-edge  and  is  denoted  by  e1^ . 

One  form  of  constructing  orbits  is  based  on  circular  periodic  trajectories,  where  the  edges  are  the  lines  that  are  tangent 
to  the  orbits.  Figure  1  shows  a  simple  PNPRM  with  three  orbits  Ol ,  Or ,  and  OK  On  each  orbit  four  nodes  are  selected 
which  are  drawn  (dots)  with  different  colors.  Edges  e2-7  and  e7"-7  connect  the  corresponding  orbits. 


Fig.  1.  A  simple  PNPRM  with  three  orbits,  twelve  nodes,  and  two  edges. 


III.  Planning  in  Information  Space 

Partially  Observable  Markov  Decision  Processes  (POMDPs)  are  the  most  general  formulation  for  motion  planning  problems 
under  motion  and  sensing  uncertainties.  Note  that  in  this  paper,  the  environment  map  is  assumed  to  be  known.  The  solution  of 
the  POMDP  problem  is  an  optimal  feedback  law  (mapping)  i r,  which  maps  the  information  (belief)  space  to  the  control  space. 
Let  us  denote  the  state,  control  and  observation  at  time  step  k  by  Xk,  Uk,  and  Zk,  respectively,  which  belong  to  spaces  X,  U, 
and  Z,  respectively.  The  belief  in  stochastic  setting  is  defined  as  the  probability  distribution  function  (pdf)  of  the  system  state 
conditioned  on  the  obtained  measurements  and  applied  controls  up  to  the  k- th  time  step,  i.e.,  bk  :=  p(xk\zo-.k',  uo-.k-i)  and  B 
denotes  the  belief  space,  containing  all  possible  beliefs.  Note  that  zo:k  =  {zi,  Z2,  •  •  •  ,Zk}  and  uo:k-i  =  {ui,U2,  •  •  •  ,  Uk- 1}. 
It  is  well  known  that  the  POMDP  problem  can  be  posed  as  a  Markov  Decision  Process  (MDP)  in  belief  space  [9],  [27], 
whose  solution  i r  is  computed  by  solving  the  following  Dynamic  Programming  (DP)  equation: 

J(b)  =  min{c(6,  u)  +  [ p(b'\b,  u)  J(b')db'},  \/b  E  B  (3a) 

u  J  b 

7 r(b)  =  argmin{c(6,  u)  +  [ p{b'\b,  u)  J(b')db'},  V6  E  B  (3b) 

u  J  b 

where  J(-)  :  B  — >>  M  is  the  optimal  cost-to-go  function,  p(b'\b,u)  is  the  belief  transition  pdf  under  control  u,  and  c(b,u)  is 
the  one-step  cost  of  taking  control  u  at  belief  b. 
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IV.  FIRM  Framework  Based  on  PNPRM 

It  is  well  known  that  the  above  DP  equation  is  exceedingly  difficult  to  solve  since  it  is  defined  over  an  infinite-dimensional 
belief  space.  In  this  section,  inspired  by  sampling-based  methods,  we  build  a  graph  in  belief  space  by  sampling  beliefs  from 
belief  space  and  connecting  them  to  each  other.  Hence  we  reduce  the  intractable  DP  in  (3)  to  a  tractable  DP  over  this  graph. 

Graph  nodes:  Let  us  denote  the  nodes  and  edges  of  the  underlying  PNPRM  by  V  =  U^=1V-7  =  U™=1{vj,}™=1  and 
£  =  { eij },  respectively.  Corresponding  to  each  PNPRM  node  v^,  we  have  a  unique  belief  bJa  whose  reachability  can  be 
guaranteed  utilizing  appropriate  feedback  controllers.  A  concrete  example  of  designing  such  a  controller  and  computing 
will  be  provided  in  Section  V.  We  define  the  j- th  graph  node  in  belief  space  (or  FIRM  node)  B 9  as  a  neighborhood  around 
bJa ;  i.e.,  BJa  =  {b  :  \\b  —  bJa\\  <  e}.  The  set  of  FIRM  nodes  that  correspond  to  the  i-th  orbit  is  denoted  by  Yl  =  and 

the  set  of  all  FIRM  nodes  is  V  =  U-L1¥L  Figure  2  shows  an  example  set  of  Gaussian  bJa’s  corresponding  to  the  PNPRM 
nodes  in  Fig.  1.  In  Gaussian  case  each  belief  b  is  characterized  by  its  mean  x+  and  covariance  P,  denoted  by  b=(x+,P). 
In  Fig.  2,  the  mean  part  of  bJa’s  is  assumed  to  coincide  with  the  underlying  PNPRM  node  and  the  covariance  part  is  shown 
by  its  3cr  ellipse.  Also  FIRM  node  BJ2  (neighborhood  of  b2)  is  shown. 


Fig.  2.  =  is  the  center  of  belief  nodes  corresponding  to  the  nodes  shown  in  Fig.  1,  where  ’s  are  shown  by  their  3cr-ellipse.  As  an 

example  of  a  FIRM  node,  the  magnified  version  of  B2 ,  which  is  a  small  neighborhood  centered  at  b2 ,  is  shown  in  the  dotted  box,  where  the  blue  shaded 
region  depicts  the  covariance  neighborhood  and  green  shaded  region  depicts  the  mean  neighborhood. 


Graph  edges:  Each  graph  edge  in  belief  space  is  a  local  feedback  controller  /i(-)  :  B  -A  U.  The  role  of  (ia,j)- th  local 
controller,  denoted  by  /P,u,  is  to  take  the  belief  from  the  FIRM  node  Bla  to  a  FIRM  node  on  orbit  OB  i.e.,  to  U7P^.  Thus,  we 
define  Ta,ZJ  :=  min{&  >  0 G  U7i^|foo  =  b£,/xCMJ'}  G  [0,  oo]  as  the  stopping  time  of  the  controller  The  stopping 

time  is  a  random  variable  that  defines  the  time  it  takes  for  the  controller  to  drive  the  belief  from  the  initial  node  to  the  target 
orbit.  Also,  let  the  F(A\b,  fi)  be  the  probability  of  reaching  set  A  in  finite  time  under  the  local  controller  /i  starting  from  belief 
b.  Therefore,  for  a  local  controller  to  act  as  a  graph  edge,  it  has  to  satisfy  P(U7P^|6^, =  Pr(7~aP  <  oo)  =  1 
in  the  absence  of  obstacles.  In  other  words,  in  a  constraint-free  environment,  the  feedback  controller  /ia,2J  (-)  has  to  drive 
the  system’s  belief  from  Bla  into  a  B  G  VJ  in  finite  time  with  probability  one. 

In  this  section,  it  is  assumed  that  a  set  of  edges  (local  controllers)  that  satisfy  the  mentioned  reachability  property  is 
given.  In  Section  V  we  show  that  the  above  property  can  be  accomplished  using  periodic  LQG  controller  for  the  class  of 
non-stoppable/nonholonomic  systems,  such  as  small  aerial  vehicles.  Accordingly,  we  provide  concrete  algorithms  to  construct 
local  controllers  and  their  corresponding  reachable  nodes. 

Graph  in  belief  space:  Formally,  we  define  the  constructed  graph  as  G  =  (V,  M)  with  the  set  of  nodes  V  =  {BJa} 
and  the  set  of  edges  M  =  The  set  of  edges  available  (i.e.,  outgoing)  at  FIRM  node  Bla  is  denoted  by  M (i,a)  := 

£  M|zle^  G  £}.  It  is  worth  noting  that  the  planning  is  still  performed  over  continuous  state,  control,  and  observation 
spaces  and  we  do  not  discretize  any  of  those. 

Graph  transition  cost  and  probabilities:  We  generalize  the  one-step  transition  costs  c(b,  u )  and  probabilities  to  the  cost 
of  taking  a  controller  in  a  graph  node  and  its  corresponding  transition  probabilities  along  the  graph  edges: 

,  ij  q-ot ,  ij 

C9(Bia, na'^)  :=  Y,  c(h,Ha’ij(bk)\b0  =  i&)*Y  c(bk,na’ij(bk)\b0  =  b),  \/b  e  B'a  (4a) 

k=0  k=0 

P*(B*|St  •=  ~  nB%^ij),  Vb  G  Bla  (4b) 

The  “piecewise  constant  approximation”  in  (4)  is  an  arbitrarily  good  approximation  for  sufficiently  small  Bla  and  smooth 
cost  function  and  transition  probabilities. 

Graph  policy:  Graph  policy  tt9  :  V  -A  M  is  a  function  that  returns  a  local  controller  for  any  given  node  of  the  graph. 
We  denote  the  space  of  all  graph  policies  by  U9. 
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Graph  cost-to-go:  To  choose  the  best  graph  policy,  we  define  the  graph  cost-to-go  J9  from  every  graph  node.  Let  B ^ 
be  the  k- th  FIRM  node  visited  along  the  plan.  Then,  we  can  formally  define  the  cost-to-go  from  any  node  Bo  G  V  as: 

oo 

J»(B0;7r»)  =  £E[C»(%,7r»(%))] 

k= 0 

s-t.  (5) 

Accordingly,  the  MDP  defined  on  the  graph  is  as  follows: 

oo 

7TS*  =  argmin^E  [C9  (B^n9 

k= 0 

s±  (6) 

Obstacle-free  graph  DP:  Since  the  graph  MDP  is  defined  on  a  finite  number  of  FIRM  nodes,  we  can  form  a  tractable 
Dynamic  Programming  (DP)  to  find  the  optimal  graph  policy: 


J9(Bia)  =  min  C *•"( Bi .  )  J,  ^  ^  >?U l ^ !  K ,  ^a’ij),  Va,i,j 


7r®(Sje)=arg  min  C3(B^  /*“■«)  +  V  J* (B^l* (BZ, \B'a,  //*■«),  Va,i,j 

7=1 


(7a) 

(7b) 


where  Jp(-)  :=  min^g  J{-\'k9)  is  the  optimal  cost-to-go. 

Incorporating  obstacles  into  planning:  In  the  presence  of  obstacles,  we  cannot  assure  that  the  local  controller  /ia,^(-) 
can  drive  any  b  G  Bza  into  U 1BJ1  with  probability  one.  Instead,  we  specify  the  failure  probabilities  that  the  robot  collides 
with  an  obstacle.  Let  us  denote  the  failure  set  on  X  by  F  (i.e.,  F  =  X  —  X/ree).  Let  F(F\B'la,  /jba^)  :=  F(F\bla,  /xa,u) 
denote  the  probability  of  hitting  the  failure  set  under  local  controller  /ia,u  starting  from  Bla.  Similarly,  we  generalize  the 
cost-to-go  function  by  defining  J9(F )  as  a  user-defined  suitably  high  cost  for  hitting  obstacles.  Therefore,  we  can  modify 
(7)  to  incorporate  obstacles  in  the  state  space  as  follows: 


J9(Bia)  =  min  C9{B^^)  +  J9{F)F9{F\B^^) 

m 

+  YJJ9(Bi1)¥3{Bi1\Bia,^),  \/a,i,j 

7=1 

n9(Bl)  -arg  min  C9^,  +  J^(F)P^(F|^,  /Z^-7) 

rn 

+  YJJ9(Bi1)W3{Bi1\Bl^),  Va,  i,j 

7=1 


(8a) 


(8b) 


Thus,  all  that  is  required  to  solve  the  above  DP  equation  are  the  values  of  the  costs  C9 (Bla,  //*’*J)  and  transition  probability 
functions  F9(-\ which  are  discussed  in  Section  V. 

Overall  policy  7r:  The  overall  feedback  7 r  is  generated  by  combining  the  policy  tt9  on  the  graph  and  the  local  controllers 
rs.  However,  this  combination  leads  to  a  non-Markov  policy.  More  rigorously,  the  resulting  policy  is  a  semi-Markov 
policy  [26].  In  other  words,  the  current  action  depends  on  the  current  belief  as  well  as  the  last  visited  FIRM  node.  Thus, 
the  overall  feedback  tt  :  ¥  x  B  — »  U  can  be  written  as: 


n(B,b)  =  n9(B)(b)  =»(!>).  (9) 

Initial  controller:  Now,  let  us  consider  the  first  step  of  planning  where  the  system  has  not  visited  any  FIRM  node  yet. 
Given  the  initial  belief  is  bo,  if  bo  is  in  a  FIRM  node  B,  then  we  can  just  generate  the  control  signal  as  tt(B,  bo)  based 
on  Eq.  9.  However,  if  bo  does  not  belong  to  any  of  the  FIRM  nodes,  we  consider  a  singleton  FIRM  node  Bo  =  {bo}  and 
connect  it  to  the  graph.  Let  us  denote  the  set  of  newly  added  local  controllers  by  M(0).  Computing  the  transition  cost 
C(6o,/PJ),  and  probabilities  P(S*|60,  /A7),  and  P(F|&0,  /A7),  for  invoking  local  controllers  /A7  G  M(0)  at  bo,  we  choose 
the  best  initial  controller  /a®  as: 

rn 

n9(B0)  =  g2=arg  min  +  V  P9(^|B0,  **«)  J3(B^)  +  F9(F\B0,  ^)J9{F)}  (10) 

7=1 

Extending  tt9  to  take  Bo  into  account,  we  now  can  use  7r(Bo,bo)  to  generate  the  control  signal.  It  is  worth  noting  that 
computing  is  the  only  part  of  computation  that  depends  on  the  initial  belief  and  has  to  be  reproduced  for  every  query  with 
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a  new  initial  belief.  After  computing  /j,®  we  always  store  the  last  visited  FIRM  node  and  use  policy  i r  (computed  offline)  in 
Eq.  9  to  generate  control  signals  in  future  time  steps. 

V.  PLQG-based  FIRM  Construction 

In  this  section,  we  construct  a  concrete  instantiation  of  the  graph  described  in  the  previous  section.  We  utilize  PLQG 
controllers  to  design  graph  edges  and  reachable  FIRM  nodes  B ^  required  in  (8).  Then  we  discuss  how  the  transition 
probabilities  F9(-\Bla,  /ia,2J),  and  costs  C9(Blal  in  (8)  are  computed.  In  this  instantiation  we  restrict  the  belief  to 

Gaussian  distributions  and  we  start  by  defining  notation  needed  for  dealing  with  Gaussian  beliefs. 

Gaussian  belief  space:  Let  us  denote  the  estimation  vector  by  x+,  whose  distribution  is  bk  =  p(xk)  =  p(xk\zo:k).  Denote 
the  mean  and  covariance  of  x+  by  x+  =  E[x+]  and  P  =  E[(x+  —  x+)(x+  —  x+)T],  respectively.  Denoting  the  Gaussian 
belief  space  by  G®,  every  function  b(-)  G  G®,  can  be  characterized  by  a  mean-covariance  pair,  i.e.,  b  =  (T+,P).  Abusing 
notation,  we  also  show  this  using  “equality  relation”,  i.e.,  b  =  (x+,P). 

A.  Designing  PLQG-based  Graph  Nodes  {BJa} 

LQG  controllers:  A  Linear  Quadratic  Gaussian  (LQG)  controller  is  composed  of  a  Kalman  filter  as  the  state  estimator 
and  a  Linear  Quadratic  Regulator  (LQR)  as  the  separated  controller  [21].  Thus,  the  belief  dynamics  bk+\  =  r(bk,uk,  zk+ i) 
come  from  the  Kalman  filtering  equations,  and  the  controller  uk  =  p(bk)  that  acts  on  the  belief,  comes  from  the  LQR 
equations.  LQG  is  an  optimal  controller  for  linear  systems  with  Gaussian  noise  [9].  However,  it  is  most  often  used  for 
stabilizing  nonlinear  systems  to  a  given  trajectory  or  to  a  given  point. 

Periodic  LQG:  Periodic  LQG  (PLQG)  is  a  time- varying  LQG  that  is  designed  to  track  a  given  periodic  trajectory  [11], 
[13].  In  Appendix  I  we  review  the  periodic  LQG  controller  in  detail.  Here,  we  only  state  the  belief  reachability  result  under 
the  PLQG. 

System  model  and  quadratic  cost:  Consider  a  T-periodic  PNPRM  orbit  O  =  K  -  4)k>%  and  the  set  of  nodes  {va}  on  it. 
Let  us  denote  the  time- varying  linear  (linearized)  system  along  the  orbit  O  by  the  tuple  Tk  =  (A&,  B&,  G&,  Q&,  H&,  M&,  R&) 
that  represents  the  following  state  space  model,  where  Tk  =  Y ^+t  • 

Xk-\- 1 —  AkXk  H-  "F  ^k^k  •>  ^k  ^  N (0?  Qfc)  (11a) 

zk=  Hkxk  +  Mkvk,  vk~N(  0,Rfc).  (lib) 

Consider  a  PLQG  controller  that  is  designed  for  the  system  in  (11)  to  track  the  orbit  (xk,uk)k>  1  through  minimizing  the 
following  quadratic  cost: 

J  =  W  Xxk  +ulWuuk\,  (12) 

k>  0 

where  xk  =  xk  —  xvk  and  uk  =  uk  —  upk.  Matrices  Wx  and  Wn  are  positive  definite  weight  matrices  for  state  and  control 
cost,  respectively.  Let  us  also  define  matrices  Qk  and  Wx  such  that  GkQkGk  =  QkQk ,  Wx  =  W^WX,  for  all  k.  Now, 
consider  the  class  of  systems,  and  associated  PLQG  controllers  that  satisfy  the  following  property. 

Property  1:  The  pairs  (Akl~Bk)  and  (Ak,Qk)  are  controllable  pairs  [9],  and  the  pairs  (AklHk)  and  (Ak,Wx)  are 
observable  pairs  [9],  for  all  k  =  1,  •  •  •  ,  T. 

Belief  node  reachability  under  PLQG:  In  the  following,  we  present  three  lemmas,  through  which  we  can  construct  pairs 
of  periodic  LQG  controllers,  and  reachable  nodes  in  belief  space,  for  non-stoppable/nonholonomic  dynamical  systems. 

Lemma  1:  (Cyclostationary  behavior  of  belief  under  PLQG)  Consider  the  PLQG  controller  designed  for  the  system  in 
(11)  to  track  the  orbit  Given  Property  1  is  satisfied,  the  belief  process  bk  under  PLQG  converges  to  a  Gaussian 

cyclostationary  process  [10],  i.e.,  the  distribution  over  belief  converges  to  a  T-periodic  Gaussian  distribution,  where  we 
denote  the  mean  and  covariance  of  this  process  by  bk  and  Ck,  respectively: 

bk  ~  N{bk>  ck)  =  C/c+t),  (13) 

where  bk  =  (xk  ,  Pk )  and  bk  =  (xk,Pk).  The  covariance  matrices  Pk  is  characterized  in  Lemma  2  and  covariance  Ck  is 
characterized  in  Appendix  I  (Eq.  68). 

Proof:  See  Appendix  I.  ■ 

Lemma  2:  (Convergence  of  DPRE)  Given  Property  1,  the  following  Discrete  Periodic  Riccati  Equation  (DPRE)  has  a 
unique  Symmetric  T-Periodic  Positive  Semi-definite  (SPPS)  solution  [11],  denoted  by  Pk  : 

Pm  =  A* (A  -  AHfc  (Hfc-AHfc  +  MfeRfeM^)-1HfePfc-)A^  +  GfcQfeG^  (14) 

Moreover,  the  covariance  matrix  Pk  introduced  in  Lemma  1  is  computed  as 

Pk  =  N- N Hfe  (H* Pk 1  Ha-  Pk  (15) 

Proof:  See  [11].  ■ 
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Now,  we  state  the  main  result,  through  which  we  can  construct  the  proper  pairs  of  periodic  LQG  controller  and  reachable 
nodes  in  belief  space. 

Lemma  3:  (Belief  node  reachability  under  PLQG)  Consider  the  PLQG  controller  //  designed  for  the  system  in  (11) 
to  track  the  orbit  («)*>!•  Suppose  the  matrix  is  full  rank,  and  Property  1  is  satisfied.  Also,  consider  the  sets 
,  Bm  in  belief  space,  such  that  interior  of  Ba  contains  b k  for  some  ka  G  {1,  •  •  •  ,  T}.  Then,  under  //,  the  region 
U aBa  is  reachable  in  finite  time  with  probability  one. 

Proof:  The  intuitive  idea  behind  the  proof  is:  if  we  define  a  region  centered  at  the  mean  value  of  a  Gaussian  distribution, 
and  if  we  sample  from  this  distribution,  in  a  finite  number  of  samples  we  will  end  up  with  a  sample  in  the  given  region. 
The  rigorous  proof  is  detailed  in  Appendix  II.  ■ 

FIRM  nodes:  As  mentioned,  to  construct  a  graph  in  belief  space  we  first  construct  its  underlying  PNPRM,  characterized 
by  the  triple  {{O-7},  {v^},  {e^}}.  Linearizing  the  system  along  the  j- th  orbit  &  =  k  a  h>0  results  in  a  time- varying 
T-periodic  system  T[  =  (A'[. G|;,  H|:, M'[, Rjj: 

xk+i=  Ajkxk +Bjkuk  +  Gjkwk,  wk  ~  V(0,  Q,{)  (16a) 

zk=  BJkxk  +  MJkvk,  vk  ~  Af( 0,  R4),  (16b) 

where  wj^  and  Vk  are  motion  and  measurement  noises,  respectively,  drawn  from  zero-mean  Gaussian  distributions  with 
covariances  and  R^.  Since  the  system  in  (16)  is  T-periodic  (i.e.,Y^  =  X^+T),  we  can  design  a  corresponding  PLQG 
controller  pk.  The  controller  pk  is  referred  to  as  the  j-th  node -controller.  Since  the  orbits  are  designed  such  that  Property 
1  is  satisfied  on  them,  based  on  Lemma  1  the  belief  converges  to  a  Gaussian  cyclo stationary  process.  The  mean  of  this 
cyclostationary  process  is  denoted  by  bk  and  is  characterized  in  Lemma  2,  where  its  existence  and  uniqueness  are  guaranteed. 

Corresponding  to  the  PNPRM  node  on  orbit  we  choose  the  belief  nodes  B°a  as  an  e-neighborhood  of  bJa  :=  bk  = 

(vLHa>-  (SeeFig.2.)  . ' 

Bi  =  {b=(x,P):  ||x-v£||  <61,\\P~pijm<52},  (17) 

where  ||  •  ||  and  ||  •  ||m  denote  suitable  vector  and  matrix  norms,  respectively.  The  size  of  FIRM  nodes  are  determined  by  Si 
and  62 •  Based  on  Lemma  3,  GaBJa  is  a  reachable  region  under  node-controller  pk.  Note  that  Si  and  S2  need  to  be  sufficiently 
small  to  satisfy  the  approximation  in  (4). 

B.  PLQG-based  Graph  Edges  {pP^} 

The  role  of  the  local  controller  pp^  is  to  drive  the  belief  from  the  node  Bla  to  U7RT  i.e.,  to  a  node  B^  on  the  j-th 
orbit.  To  construct  the  local  controller  pP^f  we  precede  the  node-controller  pk,  with  a  time- varying  LQG  controller  pk,lJ , 
which  is  called  the  edge -controller  here. 

Edge-controller:  Consider  a  finite  trajectory  that  consists  of  three  segments:  i)  the  pre-edge  eloi J  as  defined  in  Section  II, 
ii)  the  edge  itself  e2J  ,  and  iii)  a  part  of  O-7  that  connects  the  ending  point  of  e2-7  to  x q  .  Edge-controller  Jikj  is  a  time- varying 
LQG  controller  that  is  designed  to  track  this  finite  trajectory.  The  main  role  of  the  edge-controller  is  that  it  takes  the  belief 
at  node  B{  and  drives  it  to  the  vicinity  of  a  starting  point  of  orbit  0-7  ,  where  it  hands  over  the  system  to  the  node-controller, 
and  node-controller  in  turn  takes  the  system  to  a  FIRM  node. 

Local  controllers:  Thus,  overall,  the  local  controller  (or  graph  edge  in  belief  space)  pP ,2J  is  the  concatenation  of  the 
edge-controller  ~pk'lJ  and  the  node-controller  pk.  Note  that  since  reachability  is  guaranteed  by  the  node-controller  (PLQG), 
by  this  construction,  the  stopping  region  U  1B°1  is  also  reachable  under  the  local  controller  pP'%K 

C.  Transition  Probabilities  and  Costs 

In  general,  it  can  be  a  computationally  expensive  task  to  compute  the  transition  probabilities  F(-\Bla,  ppM)  and  costs 
C(Bla,  associated  with  invoking  local  controller  at  node  Bla.  However,  owing  to  the  offline  construction  of 

FIRM,  it  is  not  an  issue  in  FIRM.  We  utilize  sequential  Monte-Carlo  methods  [15]  to  compute  the  collision  and  absorption 
probabilities.  In  other  words,  for  each  graph  edge  we  simulate  the  execution  of  the  corresponding  local  controller  for  M 
times  and  accordingly  approximate  the  probability  of  reaching  the  nodes  on  the  target  orbit  as  well  as  probability  of  hitting 
the  failure  set  along  the  way.  This  process  is  done  offline. 

Depending  on  the  application,  a  suitable  transition  cost  can  be  defined.  In  this  paper,  we  consider  a  measure  of  estimation 
accuracy  as  the  transition  cost  along  the  edges.  This  leads  to  a  planner  that  favors  paths,  on  which  the  estimator  and 
consequently  the  controller  can  perform  better.  A  measure  of  estimation  error  we  use  here  is  the  trace  of  estimation  covariance; 
i.e.,  =  E[^=1  tr(Pk^)],  where  is  the  estimation  covariance  at  the  k- th  time  step  of  the  execution  of  local 

controller  pP'lK  The  outer  expectation  operator  is  useful  in  dealing  with  the  Extended  Kalman  Filter  (EKF),  whose  covariance 
is  stochastic  [14],  [25].  Moreover,  as  we  are  also  interested  in  faster  paths,  we  take  into  account  the  corresponding  mean 
stopping  time,  i.e.,  Ta =  E [Ta,zjf],  and  the  total  cost  of  invoking  pP at  Bla  is  considered  as  a  linear  combination  of 
the  estimation  accuracy  and  expected  stopping  time,  with  suitable  scalar  coefficients  £1  and  £2- 

C(Bi,pL^ij)  =  +  &Ta4j .  (18) 
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D.  Construction  of  PLQG-FIRM  and  Planning  With  it 

Offline  construction  of  FIRM:  The  crucial  feature  of  FIRM  is  that  it  can  be  constructed  offline  and  stored,  independent 
of  future  queries.  Note  that  based  on  Algorithms  1  and  2,  we  still  need  to  know  the  goal  location.  However,  to  be  fully 
independent  of  both  start  and  goal  location  of  the  query,  one  can  solve  the  DP  in  the  online  phase.  Owing  to  the  reduction 
from  the  original  POMDP  to  a  dynamic  programming  on  the  finite  number  of  graph  nodes,  we  can  solve  DP  in  (8)  using 
standard  DP  techniques  such  as  value/policy  iteration  to  get  the  optimal  graph  policy  i t9  .  Algorithm  1  details  the  offline 
construction  of  the  FIRM  graph. 

Online  planning  with  FIRM:  Since  the  FIRM  graph  is  computed  offline,  the  online  phase  of  planning  (and  replanning) 
on  the  roadmap  becomes  very  efficient.  If  the  given  initial  belief  bo  belongs  to  any  FIRM  node,  online  compuation  reduces 
to  evaluating  the  function  i r  in  Eq.  9.  Otherwise,  the  only  online  compuation  would  be  evaluation  of  the  first  local  controller 
fi 2  based  on  Eq.  10.  In  the  latter  case,  to  form  the  new  edges  required  in  10,  we  first  create  a  singleton  set  B0  =  {fro}. 
Then,  to  connect  Bo  to  FIRM,  we  compute  the  expected  value  of  the  robot  state,  i.e.  E[xo\  using  its  distribution  fro  and 
add  E[#o]  1°  the  underlying  PNPRM  nodes.  The  set  of  newly  added  edges  going  from  E[x0]  to  the  nodes  on  PNPRM  are 
denoted  as  £(0).  We  design  the  local  controllers  associated  with  each  edge  in  £(0)  and  call  the  set  of  them  as  M(0).  Then, 
we  choose  /i®  based  on  Eq.  10  and  follow  policy  i r  in  Eq.  9  afterwards.  Algorithm  2  illustrates  this  procedure. 

Computational  complexity  of  offline  graph  construction:  Consider  an  underlying  PNPRM  with  N  orbits,  m  nodes 
on  each  orbit,  and  degree  k\  i.e.,  each  orbit  in  PNPRM  is  connected  to  k  nearest  neighboring  orbits.  Thus,  overall  it  has 
mN  nodes  and  Nk  orbit  edges.  In  the  offline  phase  we  need  to  leverage  PNPRM  orbits  and  edges  to  FIRM  orbits  and 
edges  in  belief  space,  (i)  Extension  of  PNPRM  orbits  to  belief  space  consists  of  a  constant  computation  of  solving  two 
Riccati  equations  and  designing  corresponding  PLQG  controller.  Denoting  the  computational  complexity  of  this  process  by 
cn,  the  computational  complexity  of  extending  PNPRM  orbits  to  FIRM  orbits  is  of  the  order  0(cnN).  (ii)  Extension  of  each 
PNPRM  edge  to  belief  space  consists  of  evaluating  the  performance  of  its  corresponding  local  controller  and  computing 
transition  probabilities  and  costs.  Let  us  denote  the  cost  of  this  process  by  ce.  In  a  PNPRM  with  degree  k ,  we  have  Nk 
edges  and  corresponding  to  each  PNPRM  edge,  we  have  m  FIRM  edges.  Thus,  the  computational  complexity  of  extending 
edges  to  belief  space  is  0(cemNk).  So,  overall  the  offline  computational  complexity  is  0(cnN  +  cemNk).  The  complexity 
of  each  iteration  in  value  iteration  algorithm  is  0(|V|2|M|),  where  |V|  =  mN  nodes  and  |M|  =  mNk  .  However,  in  practice 
the  dominating  factor  is  the  extension  of  edges  to  belief  space  because  the  constant  multiplier  ce  in  general  is  large.  If 
the  Monte  Carlo  simulation  is  chosen  to  evaluate  the  edge  costs  and  transition  probabilities,  ce  will  increase  linearly  in  the 
number  of  particles  utilized  in  the  Monte  Carlo  simulation  as  well  as  the  number  of  constraints.  It  will  also  depend  on  how 
the  constraints  are  being  evaluated. 

Computational  complexity  of  online  planning  with  graph:  As  discussed  in  Section  IV,  the  only  part  that  needs  to  be 
done  online  is  the  computation  of  first  local  controller  (See  Eq.  10).  To  do  so,  we  need  to  evaluate  k  edges  only.  Thus, 
the  computational  complexity  of  online  planning  with  FIRM  is  0(cek).  This  computation  occurs  once  in  the  beginning  of 
planning.  The  rest  of  planning  is  just  plugging  last  visited  FIRM  node  B  and  current  belief  fr  into  the  planner  i r  (See  Eq. 
9)  and  generating  the  control  signal  u k. 

VI.  Experimental  Results 

In  this  section  we  present  simulation  results  for  two  different  types  of  robots:  a  planar  robot  whose  motion  is  described  by  a 
unicycle  model  and  a  6  DoF  small  aerial  vehicle  subject  to  rigid  body  kinematics.  The  robots  are  equipped  with  exteroceptive 
sensors  that  provide  range  and  bearing  measurements  from  existing  radio  beacons  (landmarks)  in  the  environment. 

A.  2D  Unicycle  Model 

Here,  we  illustrate  the  results  of  FIRM  construction  on  a  simple  PNPRM. 

Motion  model:  As  a  motion  model,  we  consider  the  nonholonomic  unicycle  model  which  has  the  following  kinematics: 

(x/c  +  (VkSt  +  nvVSt )  cos  0k  \ 

y k  +  (Nk5t  H-  Tiys/di)  sin  0k  J  ,  (19) 

6k  +  uk5t  +  ) 

where  xk  =  (xk,y k,Qk)T  describes  the  robot  state  (2D  position  and  heading  angle).  The  vector  uk  =  (14,  ujk)T  is  the  control 
vector  consisting  of  linear  velocity  Vk  and  angular  velocity  ujk.  The  motion  noise  vector  is  denoted  by  wk  =  ( nv,nUJ)T  ~ 

M(0,Qk)- 

Observation  model:  The  i-th  landmark  is  denoted  by  Ll  and  the  vector  from  robot  to  the  i- th  landmark  is  denoted  by 
ld  =  [zdxfdy]T  :=  Ll  —  p,  where  p  =  [x,  y]T  is  the  position  of  the  robot.  Measuring  Ll  is  modeled  as  follows: 

=  lh(x,  lv)  =  [||M||,  atan2(z<4,  ldx )  —  0]T  +  4,  (20) 

where,  atan2(-,  •)  is  the  four-quadrant  inverse  tangent  function.  Observation  noise  is  drawn  from  a  zero-mean  Gaussian 
distribution  lv  ~  A/’(0,2R)  where  *R  =  diag((?7r ||M||  +  cr£)2,  (?76> ||zd ||  +  cr^)2).  Function  “diag”  returns  a  square  block- 
diagonal  matrix  by  placing  its  inputs  on  the  main  diagonal.  The  uncertainty  (standard  deviation)  of  sensor  reading  increases 
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Algorithm  1:  Offline  Construction  of  PLQG-FIRM 

input  :  State  space  X,  constraints  set  F,  belief  space  B 
output  :  FIRM  graph  G 

Construct  a  PNPRM  with  T-periodic  orbits  O  =  {O-7  =  (xpk ,  ^)fe>o}»  nodes  V  =  {v^},  and  edges  £  =  {e2-7},  where 
i,j  =  1,  •  •  •  ,  n  and  a  =  1,  •  •  •  ,  m\ 
foreach  PNPRM  orbit  CP  e  O  do 

Design  the  node-controller  (periodic  LQG)  pk  along  the  periodic  trajectory; 

Compute  the  periodic  mean  belief  trajectory  bk  =  (xk\P^)  using  (15); 

Construct  m  FIRM  nodes  ¥J  =  {FfJ,  •  •  •  ,  B P}  using  (17),  where  B°a  is  centered  at  bk^\ 

Collect  all  FIRM  nodes  ¥  =  U™=1Yj; 
foreach  (Bla,elj)  pair  do 

Design  the  edge-controller  pk,lJ ,  as  discussed  in  Section  V-B; 

Construct  the  local  controller  pk'lJ  by  concatenating  edge-controller  ~fik,lJ  and  node-controller  p?k\ 

Set  the  initial  belief  bo  equal  to  the  center  of  Bla,  based  on  the  approximation  in  (4); 

Generate  (in  simulation)  sample  belief  paths  bo-.r  and  state  paths  xo-.r  induced  by  controller  ppP  invoked  at  Bla; 
Compute  the  transition  probabilities  F9(F\Bla,  paP)  and  ¥9(Bif\Bla,  paP)  for  all  7  and  transition  cost 
C9(Bla,  paP)  based  on  the  simulated  trajectories  (see  Section  V-C); 

Collect  all  local  controllers  M  =  I//*’2-7}; 

Compute  cost-to-go  J9  and  feedback  7 t9  over  the  FIRM  graph  by  solving  the  DP  in  (8); 

G=  (¥,M,  J9,tt9); 

return  G; 


as  the  robot  gets  farther  from  the  landmarks.  The  parameters  rjr  =  rjo  =  0.3  determine  this  dependency,  and  =0.01  meter 
and  adh  =0.5  degrees  are  the  bias  standard  deviations.  A  similar  model  for  range  sensing  is  used  in  [24].  The  robot  observes 
all  Nl  landmarks  at  all  times  and  their  observation  noises  are  independent.  Thus,  the  total  measurement  vector  is  denoted 
by  z  =  \izT ,  2 zT ,  •  •  •  ,  NLZpy'  anc[  c[ue  to  the  independence  of  measurements  of  different  landmarks,  the  observation  model 
for  all  landmarks  can  be  written  as  z  =  h(x)  +  v,  where  v  ~  Af{ 0,  R)  and  R  =  diag(1R,  •  •  •  ,  ^R). 

We  first  show  a  typical  SPPS  solution  of  DPRE  on  the  orbits.  Fig.  3(a)  shows  a  simple  environment  with  six  radio  beacons 
(black  stars).  For  illustration  purposes,  we  choose  five  large  circular  orbits  and  every  orbit  is  discretized  to  100  steps.  Thus 
the  SPPS  solution  of  the  DPRE  in  (14)  on  each  orbit  leads  to  hundred  covariance  matrices  that  are  superimposed  on  the  graph 
in  red.  As  is  seen  from  Fig.  3(a),  the  localization  uncertainty  along  the  orbit  is  not  homogeneous  and  varies  periodically. 
Another  important  observation  from  the  Fig.  3(a)  is  obtained  by  noticing  the  left  top  orbit  in  the  Fig.  3(a).  As  it  can  be 
seen,  the  localization  uncertainty  (covariance  ellipse)  in  the  left  and  right  hand  sides  of  the  landmark  are  not  symmetric 
(the  right  hand  side  is  larger  than  the  left  hand  side).  In  other  words,  two  points  on  an  orbit  with  the  same  distance  from 
landmarks  (i.e.,  with  the  same  observation  noise)  might  have  different  localization  uncertainty,  which  emphasizes  the  role 
of  the  dynamics  model  in  filtering  and  its  interaction  with  the  observation  model.  In  Fig.  3(b),  we  illustrate  the  covariance 


(a)  (b) 


Fig.  3.  (a)  Five  orbits  (T  =  100)  and  corresponding  periodic  estimation  covariances  as  the  SPPS  solution  of  DPRE  in  (14).  (b)  Sample  covariance 

convergence  on  an  orbit  (T  =  20)  under  PLQG.  Red  ellipses  are  the  solution  of  DPRE  and  green  ellipses  are  the  evolution  of  estimation  covariance.  The 
initial  covariance  is  three  times  bigger  than  the  SPPS  solution  of  DPRE,  i.e.,  Pq  =  3Pq. 
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Algorithm  2:  Online  Phase  Algorithm  (Planning  with  PLQG-based  FIRM) 


1  input  :  Initial  belief  bo,  FIRM  graph  G,  Underlying  PNPRM  graph 

2  if  G  V  such  that  bo  G  Bla  then 

3  |  Choose  the  next  local  controller  pF^3  =7 r9(Bla)’, 

4  else 

5  Compute  vo  =  E[#o]  based  on  bo,  and  connect  vo  to  the  PNPRM  orbits.  Call  the  set  of  newly  added  edges 
£(0)  =  {e<W}; 

6  Design  local  planners  associated  with  edges  in  £(0);  Collect  them  in  set  M(0)  =  {p0,°3}; 

7  foreach  p  G  M(0)  do 

8  Generate  (simulate)  sample  belief  and  state  paths  bo-.r ,  %o-.T  induced  by  taking  p  at  bo; 

9  Compute  transition  probabilities  P(-|&o,aO  and  transition  costs  C(bo,p); 
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Set  a,i  =  0;  Choose  the  best  initial  local  planner  p0,°3  within  the  set  M(0)  using  (10); 


n 

12 

13 

14 

15 

16 


while  Bla  ^  Bg0ai  do 

while  (JP^s.t.,  bk  G  B^)  and  “no  collision”  do 
Apply  the  control  Uk  =  p%,tJ  (bk)  to  the  system; 
Get  the  measurement  zk+ i  from  sensors; 
if  Collision  happens  then  return  Collision; 

_  Update  belief  as  bk+ i  =  r(bk,  p^l\bk),  zk+1)\ 


17  Update  the  current  FIRM  node  B\  =  B^\ 

18  Choose  the  next  local  controller  pa^3  =7 r9(Bla); 


convergence  in  the  periodic  belief  process.  As  can  be  seen  in  Fig.  3(b),  the  initial  covariance  is  three  times  larger  than 
the  limiting  covariance,  and  in  less  than  one  period  it  converges  to  the  SPPS  solution  of  DPRE.  The  convergence  time  is 
a  random  quantity,  whose  mean  and  variance  can  be  estimated  through  simulations.  However,  in  practical  cases  it  usually 
converges  in  less  than  one  full  period,  because  the  initial  covariance  is  closer  to  the  actual  solution  (due  to  the  use  of 
edge-controllers)  and  also  the  orbit  size  is  much  smaller,  when  compared  to  Fig.  3(b). 

Figure  4(a)  shows  a  sample  PNPRM  with  23  orbits  and  67  edges.  To  simplify  the  explanation  of  the  results,  we  assume 
m  =  1,  i.e.,  we  choose  one  node  on  each  orbit.  All  elements  in  Fig.  4(a)  are  defined  in  (x,  y,  0)  space  but  only  the  (x,  y) 
portion  is  shown  here.  To  construct  the  FIRM  nodes,  we  first  solve  the  corresponding  DPREs  on  each  orbit  and  design 
its  corresponding  node-controller  (PLQG).  Then,  we  pick  the  node  centers  =  (v^,P^  )  and  construct  the  FIRM  nodes 
based  on  the  component- wise  version  of  (17),  to  handle  the  error  scale  difference  in  position  and  orientation  variables: 

Bl  {b  (x,  P)  \x-vi\<e,\P-Pij<A},  (21) 

where  |-|  and  <  stand  for  the  absolute  value  and  component- wise  comparison  operators,  respectively.  We  set  e  =  [0.8, 0.8,  5°]T 
and  A  =  eeT  to  quantify  B3^  s. 

After  designing  FIRM  nodes  and  local  controllers,  the  transition  costs  and  probabilities  are  computed  in  the  offline 
construction  phase.  Here,  we  use  sequential  weighted  Monte-Carlo  based  algorithms  [15]  to  compute  these  quantities. 
In  other  words,  for  every  (Bla,pa’13)  pair,  we  perform  M  runs  and  accordingly  approximate  the  transition  probabilities 
¥9(Bif\Bla,  pa^3),  F9(F\Bla,  pa^3),  and  costs  C9(Bla,  pa^3).  A  similar  approach  is  detailed  in  [5].  Table  I  shows  these 
quantities  for  several  (P^,  pa,t3)  pairs  corresponding  to  Fig.  4(a),  where  M  =  101  and  the  coefficients  in  (18)  are  =  0.98 
and  £2  =  0.02. 

TABLE  I 

Computed  costs  for  several  pairs  of  node-and-controller  using  101  particles. 


(. B pair 

2>3) 

4’5) 

B*y>V’20) 

P9(F|B*  ,naM) 

9.9010% 

17.8218% 

15.8416% 

29.7030% 

7.9208% 

1.9802% 

0.9901% 

QOl, 13 

2.1386 

2.2834 

1.9181 

0.9152 

2.1695 

1.1857 

0.4385 

E 

63.6703 

82.6747 

62.5882 

58.2000 

51.7033 

50.2755 

35.4653 

Plugging  the  computed  transition  costs  and  probabilities  into  (8),  we  can  solve  the  DP  problem  and  compute  the  policy  7 t9 
on  the  graph.  This  process  is  performed  only  once  offline,  independent  of  the  starting  point  of  the  query.  Fig.  4(b)  shows  the 
policy  7 t9  on  the  constructed  FIRM  in  this  example.  At  every  FIRM  node  Bla ,  the  policy  7 t9  decides  which  local  controller 
needs  to  be  invoked,  which  in  turn  aims  to  take  the  robot  belief  to  the  next  FIRM  node.  It  is  worth  noting  that  if  we  had 
more  than  one  node  on  each  orbit,  the  feedback  7 t9  may  return  different  controllers  for  each  of  them  and  for  every  orbit  we 
may  have  more  than  one  outgoing  arrow  in  Fig.  4(b). 
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(a)  (b) 

Fig.  4.  A  sample  PNPRM  with  circular  orbits.  Number  of  each  orbit  is  written  at  its  center.  Nine  landmarks  (black  stars)  and  obstacles  (gray  polygons) 
are  also  shown.  The  directions  of  motion  on  orbits  and  edges  are  shown  by  little  triangles  with  a  cross  in  their  heading  direction,  (a)  Orbits  2  and  7 
(distinguished  in  black)  are  the  start  and  goal  orbits,  respectively.  Shortest  path  (green)  and  the  most-likely  path  (red)  under  the  FIRM  policy  are  also 
shown,  (b)  Assuming  on  each  orbit,  there  exists  a  single  node,  the  feedback  n9  is  visualized  for  all  FIRM  nodes. 


As  discussed,  the  online  part  of  planning  is  very  efficient  as  it  only  requires  executing  the  controller  and  generating  the 
control  signal.  Moreover,  if  due  to  some  unmodeled  large  disturbances,  the  system  deviates  significantly  from  the  planned 
path,  it  suffices  to  bring  the  system  back  to  the  closest  FIRM  node  and  from  thereon  the  optimal  plan  is  already  known, 
i.e.,  7 t9  drives  the  robot  to  the  goal  region  as  shown  in  Fig.  4(b). 

We  show  the  most  likely  path  under  the  i t9  in  red  in  Fig.  4(a).  The  shortest  path  is  also  illustrated  in  Fig.  4(a)  in  green. 
It  can  be  seen  that  the  “most  likely  path  under  the  best  policy”  detours  from  the  shortest  path  to  a  path  along  which  the 
filtering  uncertainty  is  smaller,  and  it  is  easier  for  the  controller  to  avoid  collisions. 

B.  6  DoF  Aircraft  Model 

In  this  section,  we  consider  a  surveillance  application  for  a  small  fixed  wing  aerial  vehicle.  Methods  such  as  [7]  have 
investigated  stochastic  optimal  control  of  small  aerial  vehicles  under  stochastic  wind.  In  this  section,  we  extend  such  methods 
to  belief  space  where  the  perfect  state  of  vehicle  is  not  available.  We  assume  that  targets  to  monitor  are  submitted  from  the 
control  station  frequently.  Each  time  a  new  target  is  submitted,  the  aircraft  has  to  replan  in  real-time  and  go  toward  the  new 
goal,  while  minimizing  the  collision  probability  and  the  costs  associated  with  the  task  objective. 

System  state:  The  system  considered  in  this  experiment  is  a  robot  with  6  Degrees  of  Freedom  (DoF).  The  motion  is  the 
rigid  body  6  DoF  kinematics.  The  state  of  the  robot  Xk  at  time  k  is  composed  of  its  3D  position  in  Cartesian  coordinates 
Pk  described  in  the  ground  (inertial)  frame  and  its  orientation  q&,  which  is  encoded  by  quaternions. 

Xk  =  [pf,qf]T  =  [xk,yk,Zk,gofe,gifc,?2fc,43fe]T,  (22) 

where 

Pfc  =  [*k,yk,Zk]T,qfc  =  [vofc ,  Qifc ,  V2fc ,  V3fc]r,  (23) 


Ground  Frame  (inertial) 


Fig.  5.  The  aircraft  attached  body-fixed  frame  and  ground  frame 
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Remark:  Let  us  denote  the  control  error  (the  difference  between  the  desired  state  xp  and  the  mean  of  estimated  state  xk) 
by  e^.  In  computing  the  control  error,  one  can  directly  subtract  the  positional  part  of  the  state  vector.  However,  the  error  in 
orientation  (quaternions)  q  and  q'  is  calculated  as  £q  =  q0inv(q')  where  0  and  inv(-)  denote  the  quaternion  multiplication 
and  inversion  operators,  respectively.  We  set  Sqo  =  0  in  calculating  the  control  error.  This  is  valid  for  small  rotations  since 
a  change  in  the  scalar  part  of  the  quaternion  does  not  provide  information  about  the  direction  of  the  rotation  vector.  Further, 
since  we  know  that  for  a  quaternion  q,  g§  +  qf  +  g2  +  Q.I  =  1>  by  controlling  gi,  g2,  #3  we  implicitly  control  go. 

Motion  Model:  Let  /  be  the  state  transition  function  such  that, 

Xk+i  =  f(xk,uk,wk ),  (24) 

where,  the  control  vector  is  composed  of  the  vehicle’s  linear  velocity  Vk  along  body  x-axis  and  angular  velocities  about 
the  body  axes. 

uk  =  [Vk,u)k,u)l,u>l\,  (25) 

in  which,  uor ,  ccp,  and  ujy  are  the  roll,  pitch,  and  yaw  rates,  respectively.  The  motion  noise  is  denoted  by  wk  =  (nv ,  n^r ,  n^v ,  n^y  )T  ~ 
J\T( 0,  Qfc).  In  our  simulations,  Q  =  dmg((r]VV  +  cr^)2,  {rqujrujr  +  cr^")2,  {r^pUJ19  +  cr^P)2),  ( rjuvU)y  +  cr^)2),  where  the 
parameters  are  rjy  =  rj^r  =  r]uP  =  r\^y  =  0.005,  cr^  =  0.02  meters,  afi  =  a%P  =  a^v  =  0.25  degrees.  To  describe  the 
kinematics  model,  we  split  the  motion  model  into  two  parts:  position  and  orientation  (attitude). 

To  derive  a  model  that  governs  the  position  of  the  robot  (i.e.,  pk+%  =  /p(Pfc,  ^k,uk,wk))9  we  first  need  to  transform 
velocity  Vk  from  body  to  the  ground  frame.  We  denote  the  velocity  in  the  body-fixed  frame  as  b  V  and  in  the  inertial  (ground) 
frame  as  9V.  Thus, 

sy  =  Rgbbv,  (26) 

where,  bV  =  [V,  0,  0]T  and  Rgb  is  the  rotation  matrix  that  transforms  the  body  frame  to  the  ground  frame.  In  terms  of  the 
quaternions,  the  Rg &  matrix  is  as  follows: 

ql  +  qI  ~  <?2  -  ql  z(qi<i2  -  qm )  ^(qm  +  qoq? ) 

Rgb( q)  =  %i<?2  +  qoq3)  Qo  -  q\  +  ql  -  qt  2(g2<?3  -  •  (27) 

2(qiq3-q0q2)  2(q2q3  +  q0qi)  ql  ~  ql  ~  ql  +  q3  _ 

Similarly,  we  transform  the  motion  noise  in  velocity  to  the  ground  frame, 

9ny  =  Rgbbny ,  (28) 

where  bny  =  [ny,  0,0]T.  Therefore,  fp  can  be  described  as: 

Pfc+i  =  /p(Pfc,qfe,  uk,wk)  =  p  k  +  9V5t  +  9nvVfrt  =  pk  +  Rgb(q)(bV5t  +  bnvV5t).  (29) 

Now,  we  discuss  the  model  we  utilize  to  govern  the  orientation  of  the  robot  (i.e.,  q/c+i  =  fq(*lk,  uk,  wk)).  We  start  by  the 
quaternion-based  attitude  kinematics  in  its  continuous-time  form  that  can  be  written  as  q  =  Auo,  where  cc  =  [ccr,  uoy]T 
is  the  angular  velocity  vector  of  the  robot  with  respect  to  the  inertial  frame  expressed  in  the  body  frame,  and  A  is  given  by: 

-Qi  -Q2  -tf3 

go  -Q3  Q2 

<73  go  -gi 

-Q2  gi  go 

Therefore,  the  discrete  version  of  the  quaternion  evolution  (before  sign  check)  is  as  follows: 

Qfc+1  =  /q(qfc>ufc,wfe)  =  qfe  +q  5t  +  nq,  (31) 

where, 

nq  =  A^n^r ,  n^p ,  nuy  )TV&t.  (32) 

However,  to  avoid  discontinuity  in  the  control  error  e^,  we  keep  the  scalar  part  of  quaternion  positive;  i.e.  the  quaternion 
at  the  (k  +  l)-th  time  step  is: 

qfc+i  =  /(qfc+i)  =  qfc+isign(qofc+1)>  (33) 

where  sign(q§fc+1)  is  1  if  q§fc+1  >  0,  and  is  —1  otherwise.  This  procedure  leads  to  the  smaller  angle  since  go  =  cos(0/2) 
where  </>  is  the  magnitude  of  rotation,  and  thus,  the  smaller  angular  difference  (i.e.,  \(j)\  <  7r)  always  leads  to  a  positive  go. 

Note  that  we  are  allowed  to  do  this  because  quaternions  are  invariant  to  sign;  i.e.,  q/c+i  and  —  q/c+i  represent  the  same 
orientation.  Thus  overall  we  get  qfe+i  =  fq(qk,uk,wk)  =  f(fq(qk,uk,wk)). 
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Finally,  since  the  quaternions  norm  is  constrained  (i.e.,  ||q||  =  1),  if  the  result  of  an  approximate  calculation  such  as 
linearized  Kalman  filter  is  a  quaternion  q  that  does  not  satisfy  this  constraint,  we  apply  the  transformation  q'  =  ||q||_1q  = 
g( q).  Note  that  function  g  is  applied  on  the  mean  value  and  its  first  order  approximation  is  applied  on  the  covariance  of  the 
quaternion  estimation. 

Observation  Model:  The  3D  location  of  the  i-th  Landmark  is  defined  as  L%  =  [Llx,  LK,Llz\.  We  denote  the  relative  vector 
from  robot  to  landmark  Ll  by  ld9  =  [ld9c,ld9,ld9z\T  :=  Ll  —  p,  where  p  =  [x,  y,  z]T  is  the  position  of  the  robot  in  the 
ground  frame.  The  relative  vector  ld9  needs  to  be  rotated  from  the  ground  frame  to  the  body  frame  by  the  rotation  matrix 
Rbg  =  RTgb.  Thus,  =  iV d®. 

The  measurement  Ll  can  be  modeled  as  follows: 

lz  =  h(x,lv)  =  [V,  la,  l/3]T  =  [||M5||,atan2(2d^/d^),atan2(M^/d^)]T  +  lv,  lv~N( 0,R3)  (34) 

where  R3  =  diag((?7r  ||2d||  +  cr£)2,  (77^ ||zd||  +  )2,  (77/3 ||zd||  +  erf)2)-  The  parameters  are  r\r  —  0.01,  =  rjp  =  0.3  and 

arh  =0.01  meter  and  cr%  =  erf  =  0.5  degrees  are  the  bias  standard  deviations. 

PNPRM  generation:  To  generate  the  underlying  PNPRM,  we  need  to  sample  orbits  and  connect  them  to  each  other. 
In  this  experiment,  we  consider  circular  (counter-clockwise)  orbits  that  are  parallel  to  the  ground.  To  sample  an  orbit,  we 
sample  a  random  point  pc  in  3D  space  as  the  orbit  center,  and  generate  a  circular  trajectory  with  a  given  maximum  yaw 
rate  centered  at  pc.  More  details  on  this  construction  can  be  found  in  [1].  Finally,  we  choose  three  nodes  on  each  orbit 
uniformly  distributed  along  the  orbit. 

The  edge  connecting  node  vla  to  orbit  O 9  is  composed  of  two  segments:  pre-edge  etcx 9  and  orbit-edge  elK  The  edge  elJ , 
connects  the  leaving  point  on  orbit  Ol  to  the  entry  point  on  orbit  OK  To  construct  elK  we  use  the  RRT  (Rapidly  exploring 
Random  Tree)  approach  [22].  However,  we  inject  user  information  and  guide  the  sampling  procedure  in  RRT  to  obtain 
better  and  faster  results.  The  details  of  this  implementation  can  be  found  in  [1].  It  is  worth  noting  that  in  our  PNPRM 
construction  for  both  2D  and  3D  systems,  we  assume  that  orbits  are  counter-clockwise  in  direction.  An  alternate  approach 
with  both  clockwise  and  counter-clockwise  orbits  could  also  be  adopted  since  our  method  is  not  restrictive  in  that  sense.  In 
this  simulations,  we  limit  ourselves  to  a  single  orbit  direction  for  reasons  of  simplicity  and  clarity. 

Planning  for  6D  aircraft  with  FIRM:  After  generating  a  PNPRM,  we  leverage  the  orbits  and  edges  to  belief  space  as 
discussed  in  Section  V.  Accordingly,  we  compute  the  edge  costs  and  solve  the  DP  on  the  FIRM  graph  to  get  a  feedback 
from  graph  nodes  to  graph  edges.  Fig.  6  depicts  a  3-D  environment  with  the  constructed  PNPRM.  The  robot  is  given  a  task 
to  visit  nodes  2,  3  and  7  in  that  order  starting  from  node  1.  These  nodes  represent  locations  where  the  robot  is  to  perform 
intelligence  gathering.  Fig.  7  shows  the  feedback  n9  on  the  FIRM  graph;  i.e.,  it  shows  the  best  edge  that  1 t9  selects  at  each 
node.  Shortest  path  is  shown  in  green  whereas  the  most  likely  path  under  the  policy  is  depicted  in  red.  It  can  be  seen  that 
the  path  selected  through  FIRM  takes  routes  which  are  more  informative  and  thus  have  less  filtering  uncertainty.  It  is  worth 
noting  that  the  green  edges  are  not  a  part  of  feedback;  they  are  just  drawn  to  illustrate  the  shortest  path.  Fig.  8  shows  the 
feedback  to  go  to  node  3,  resulting  from  online  replanning  after  the  query  to  node  3  is  submitted.  Finally,  Fig.  9  shows  the 
feedback  to  node  7  after  the  next  online  replanning.  To  perform  replanning  (recomputing  the  feedback),  we  do  not  need  to 
re-construct  the  graph  or  recompute  the  edge  cost.  Multiple  queries  can  be  executed  by  simply  re-solving  the  DP  on  the 
FIRM  graph  with  a  new  goal. 


Fig.  6.  The  PNPRM  in  3D  showing  the  orbits  and  edges. 
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(a)  (b) 

Fig.  7.  Feedback  tt9  is  shown  with  orbit  2  as  the  goal  orbit,  (a)  Starting  from  orbit  1,  the  shortest  path  (green)  and  the  most-likely  path  (red)  are  shown 
from  the  top  view  (b)  The  shortest  path  (green)  and  the  most-likely  path  (red)  are  shown  in  the  3D  environment. 


100 


0  20  40  60  80  100 


(a)  (b) 

Fig.  8.  Feedback  tt9  is  shown  with  orbit  3  as  the  goal  orbit.  Starting  from  orbit  2,  the  most-likely  path  (red)  is  shown  (a)  from  the  top  view  and  (b)  in 
the  3D  environment. 


VII.  Conclusion 

This  paper  proposes  a  solution  to  the  problem  of  stochastic  planning  for  non-stoppable  (and  possibly  nonholonomic) 
systems,  such  as  small  fixed-wing  aerial  vehicles.  The  Periodic-Node  PRM  (PNPRM)  is  introduced  as  a  graph  in  the  state 
space,  whose  nodes  lie  on  periodic  trajectories,  called  orbits.  Exploiting  the  properties  of  periodic  LQG  controllers  on  the 
orbits,  we  designed  appropriate  local  controllers  to  accomplish  the  task  of  belief  reachability  for  non-stoppable  systems. 
Accordingly,  by  suitably  choosing  belief  nodes  along  the  orbits  we  constructed  a  graph  in  belief  space.  Planning  constraints 
can  be  seamlessly  embedded  along  the  edges  of  this  graph.  Finally,  the  framework  characterizes  the  success  probability  of 
reaching  the  goal  point  from  any  given  graph  node.  With  estimation  uncertainty  chosen  as  the  planning  cost,  simulation 
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(a)  (b) 


Fig.  9.  Feedback  n9  is  shown  with  orbit  7  as  the  goal  orbit.  Starting  from  orbit  3,  the  most-likely  path  (red)  is  shown  (a)  from  the  top  view  and  (b)  in 
the  3D  environment. 


results  for  two  different  types  of  robots  were  presented.  It  was  demonstrated  that  the  proposed  graph-based  scheme  for 
planning  under  uncertainty  tends  to  find  feedback  laws  that  guide  the  robot  toward  goal  through  information-rich  regions 
(leading  to  less  estimation  uncertainty)  and  regions  with  less  collision  probability. 
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Appendix  I 

Periodic  LQG  Controller 

Periodic  Linear  Quadratic  Gaussian  (PLQG)  controller  is  a  time- varying  LQG  controller  that  is  designed  to  track  a  periodic 
nominal  trajectory  in  the  presence  of  process  and  observation  noise. 

In  this  section,  we  first  discuss  the  system  linearization  and  nominal  trajectory,  and  then  discuss  the  KF,  LQR  and  LQG 
designed  along  this  trajectory.  Consider  the  nonlinear  partially-observable  state-space  equations  of  the  system  as  follows: 


Xk+1  =  f(xk,uk,wk),  wk~Af(0,Qk)  (35a) 

Zk  =  h(xk,vk),  vk  ~  A/'(0,  Rk).  (35b) 

A  T-periodic  nominal  trajectory  for  the  robot  is  a  sequence  of  planned  states  {xk)k>o  and  planned  controls  (uk)k> o,  such 
that  it  is  consistent  with  the  noiseless  dynamics  model,  i.e.,  we  have: 


vk+ 1 


=  /(«,  0), 


TP  —  TP 
Xk+T  ~ 


p  p 

Uk+T  ~  Uk' 


(36) 


The  role  of  a  closed-loop  stochastic  controller  in  tracking  a  nominal  trajectory  is  to  compensate  for  robot’s  deviations  (due 
to  the  noise)  from  the  nominal  trajectory  and  to  keep  the  robot  close  to  the  nominal  trajectory  in  the  sense  of  minimizing 
the  following  quadratic  cost: 


J  =  E 


yyajfe  -  xpk)TWx(xk  -  xpk)  +  (• Uk  -  upk)TWu{uk  -  Upk) 
k>  o 


(37) 


where  Wx  and  Wu  are  positive  definite  weight  matrices  for  the  state  and  control  cost,  respectively. 

Since  the  system’s  state  is  only  partially  observable,  at  every  step  of  LQG  execution,  a  Kalman  filter  estimates  the  system’s 
state  and  an  LQR  controller  generates  the  optimal  control  based  on  this  estimation.  We  first  linearize  the  system  along  the 
nominal  trajectory  and  then  describe  the  KF  and  LQR  designed  along  this  path. 

Model  linearization:  Given  a  periodic  nominal  trajectory  (xk,uk)k> o>  we  linearize  the  dynamics  and  observation  model 
in  (35),  as  follows: 


Xk+ 1  =  f(xpk,upk,  0)  +  Ak{xk  -  Xpk)  +  Bk(uk  -  Upk)  +  Gkwk ,  wk  ~  0,  Qk) 

zk  =  h(xpk,  0)  +  Hk(xk  -  xk)  +  Mkvk,  vk  ~  Af( 0,  Rk) 


where 


—  dl(TP 

dx  \Xk 


lPk,U- 

—  dh  (  P 
~  dx\Xki 


10) 


—  dlf^P  n,P 


0), 


^  =  1(4.0) 


,0), 


Gk=§i(xpk/I 


10), 


It  is  worth  noting  that  the  linearized  system  is  a  periodic  one,  i.e., 


(38a) 

(38b) 


(39) 


^4/c+t  =  Bk+r  =  Bk,  Gk+T  =  Gk,  Hk+r  =  Hk,  Mk+r  =  Qk+r  =  Qk >  Rk+r  =  Rk •  (40) 


Error  system:  Now,  let  us  define  the  following  errors: 

•  LQG  error  (main  error):  ek  =  xk  —  xk 

•  KF  error  (estimation  error):  ek  =  xk  —  xk 

•  LQR  error  (mean  of  estimation  of  LQG  error):  e k  =  xk  —  xpk 

Note  that  these  errors  are  linearly  dependent:  ek  ='ek  +?&.  Also,  defining  Suk  =  uk—uk  and  Szk  =  zk  —  zk  :=.  zk  —  h(xpk ,  0), 
we  can  rewrite  above  linearized  models  as  follows: 


ek+i  =  Akek  +  Bk5uk  +  Gkwk,  wk  ~  A/'(0,  Qk)  (41a) 

Szk  =  Hkek  +  Mkvk ,  vk  ~  AT(0,  Rk)  (41b) 


which  is  a  periodic  linear  system  due  to  (40). 

Periodic  Kalman  filter:  Periodic  Kalman  Filter  (PKF)  is  a  time-varying  Kalman  filter,  whose  underlying  linear  system 
is  periodic.  In  Kalman  filtering,  we  aim  to  provide  an  estimate  of  the  system’s  state  based  on  the  observations  we  have 
obtained  and  the  control  signals  we  have  applied  up  to  time  fc,  i.e.,  zo:k  and  uo,k-i-  The  estimated  state  is  a  random  vector 
denoted  by  xk,  whose  distribution  is  the  conditional  distribution  of  the  state  on  the  obtained  data  so  far,  which  is  referred 
to  as  belief  and  is  denoted  by  bk\ 


h  =P{xl)  =  p(xk\zo-.k,U0-.k-l) 

(42) 

Xk  =  E[Xfc|z0:fc,  ttO:fc— l] 

(43) 

Pk  I^Ch/c)  'UOik  —  l] 

(44) 
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where  E[- 1 *]  and  C[* | *]  are  the  conditional  expectation  and  conditional  covariance  operators,  respectively.  In  the  Gaussian 
case,  we  have  bk  =  Pk ),  i.e.,  the  belief  can  only  be  characterized  by  its  mean  and  covariance.  Hence,  we  can  show 

bk  as  the  mean-covariance  pair  bk  =  (x^,Pk).  Similar  to  the  conventional  Kalman  filtering,  PKF  consists  of  two  steps  at 
every  time  stage:  the  prediction  step  and  the  update  step.  In  the  prediction  step,  the  mean  and  covariance  of  prior  x is 
computed.  For  the  system  in  (41)  the  prediction  step  is: 


&k-\-l  —  3~~  Bkbuk 

Pk+ 1  =  +Pk^k  +  GkQkGl 

In  the  update  step,  the  mean  and  covariance  of  posterior  xk  is  computed.  For  the  system  in  (41),  the  update  step  is: 

Kk  =  P~Hl  (. HkP~Hl  +  MkRkMl)-1 
®fc+ 1  =  ek+ 1  P  Kk+l(fizk+l  —  Hk  •  1  *  A'  .  1  ) 

Pk+ 1  =  (I  ~  Kk+iHk+i)Pk+1 

Note  that 

xk  =  E[xk\z0:k,u0:k-i]  =  xvk  +E[ek\z0:k,u0:k-i]  =  xpk+  + 

Pk  =  ^['Ek\Z0:kT  V’O'.k—ll  =  P'l&k  ^0:fc— l]  ~  Pk 


(45) 

(46) 

(47) 

(48) 

(49) 

(50) 

(51) 


Lemma  4:  (Covariance  convergence  under  PLQG):  In  Periodic  Kalman  filtering,  if  for  all  k,  the  pair  ( Ak,Hk )  is 
detectable  and  the  pair  ( Ak,Qk )  is  stabilizable,  where  GkQkGk  =  QkQk,  then  the  prior  covariance  Pk,  the  posterior 
covariance  Pk,  and  the  filter  gain  Kk  all  converge  to  their  T-periodic  stationary  values,  denoted  by  Py ,  Pt,  and  Kt, 
respectively  [11].  Matrix  I)  is  the  unique  Symmetric  T-Periodic  Positive  Semi-definite  (SPPS)  solution  [1 1]  of  the  following 
Discrete  Periodic  Riccati  Equation  (DPRE): 


Pk+ 1  =  GkQkGTk  +  Ak(Pk  -  PkHl  (HkPkHk  +  MkRkMk)^  HkPk  )Ak 

Having  Pj~ ,  the  periodic  gain  Kk  and  estimation  covariance  Pk  are  computed  as  follows: 


Kk  =  PkHl \HkP~Hl 
Pk  =  {I-  KkHk)Pk 


where 


MkRkMk)~ 


p-  _  p- 

rk+T  —  rk 


K 


k+T 


=  Kk 


k+T 


=  Pk 


(52) 

(53) 

(54) 

(55) 


Proof:  See  [11].  ■ 

Note  that  if  the  pair  ( Ak^B-k )  is  detectable  and  the  pair  ( Ak,Qk )  is  stabilizable,  then  the  pair  (Ak,Hk)  is  observable  and 
the  pair  ( Ak,Qk )  is  controllable,  and  hence  Lemma  2  follows. 

Periodic  LQR  controller:  An  LQR  controller  is  utilized  as  the  separated  controller  [21]  within  the  structure  of  the  LQG 
controller.  Once  Kalman  filter  produces  the  estimation  (belief),  the  LQR  controller  generates  the  optimal  control  signal 
accordingly.  In  other  words,  we  have  a  time- varying  mapping  pk  from  belief  space  into  the  control  space  that  generates  an 
optimal  control  based  on  the  given  belief  Uk  =  Pk{^k)  at  every  time  step  k.  In  LQG,  the  mapping  pk  is  the  control  law  of 
the  LQR  controller,  which  is  optimal  in  the  sense  of  minimizing  the  following  cost: 


JpLQR  =  E 


=  E 


Ps+k  -  Xk)TWx+k  -  xk)  +  K  -  UPk)TWu(uk  -  UPk) 


k>0 


+  (5uk)TWu(6uk) 


k>0 


The  linear  control  law  that  minimizes  this  cost  function  for  a  linear  system  is: 

Suk  =  —Lk+l. ,  Lk+T  =  Lk 


(56) 


(57) 


Lemma  5:  In  Periodic  LQR  (PLQR),  if  for  all  k ,  the  pair  ( Ak,Bk )  is  stabilizable  and  the  pair  ( Ak,Wx )  is  detectable, 
where  Wx  =  WjWx,  then  the  time-varying  feedback  gains  Lk  are  T-periodic  gains,  i.e.,  Lk+r  =  Lk  and  are  computed  as 
follows: 


Lk  =  ( Bk  Sk+iBk  +  Wu)  Bk  Sk+iAk, 


(58) 
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where  Sk  is  the  SPPS  solution  of  the  following  DPRE: 

Sk  =  Wx  +  AlSk+1Ak  -  AlSk+lBk{BlSk+1Bk  +  WuflB£Sk+1Ak.  (59) 

Note  that  the  whole  control  is  uk  =  upk  +  Suk . 

Periodic  LQG  controller:  Plugging  the  obtained  control  law  of  PLQR  into  the  PKF  equations,  we  can  get  the  following 
error  dynamics: 


ek+ 1 

ek+ 1 
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or  equivalently, 
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Defining  :=  (e^e^  )T  and  qk  :=  (wfc,  u/c+i)T,  we  can  rewrite  (61)  in  a  more  compact  form  as 


Ofe+i  =  BkCk  -  Gkqk,  qk  ~  AT(0,Qk),  Qk  = 


Qk 

o 


o 

Rk+ 1 


with  appropriate  definitions  for  Fk  and  G&.  Thus,  (k  is  a  random  variable  with  a  Gaussian  distribution,  i.e., 

Ck  ~  AT(0,Vk), 

or 


W( 


where  'P;c  is  the  solution  of  the  following  Discrete  Periodic  Lyapunov  Equation  (DPLE): 


Pfe+1  =  FkVkFTk  -  GfeQfcC?r, 


(61) 


(62) 
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(64) 


(65) 


which  can  be  decomposed  into  four  blocks 


Vk  = 


(66) 


Vk,  11  Bk,12 

Bk,21  Vk,22 

Lemma  6:  Under  the  preceding  assumptions  in  Lemmas  4  and  5,  the  solution  of  DPLE  in  (65)  converges  to  a  unique 
SPPS  solution  Vk  independent  of  the  initial  covariance  Vo,  i.e.,  Vk+T  =  Vk. 

Proof:  See  [11].  ■ 

Therefore,  the  process  in  (62)  converges  to  a  cyclostationary  process  [10],  i.e.,  the  distribution  over  (k  is  periodic. 
Thus,  since  x k  ~  N(xpk,Vk,2 2),  the  distribution  over  the  estimation  mean  is  also  converges  to  a  periodic  distribution,  i.e., 
xk  rsj  N(xpk,Vk, 22)  =  N{Xk+TiBk+T, 22)-  Hence,  this  analysis  leads  to  the  following  lemma: 

Lemma  7:  Under  Periodic  LQG,  belief  falls  into  a  Gaussian  cyclostationary  process,  i.e.,  the  distribution  over  belief 
bk  =  (xk  ,  Pk )  converges  to  the  following  periodic  Gaussian  distribution: 


bk  =  (Ak,Pk)  ~  V 


4 


Pfc,22  0 

0  0 


(67) 


The  degeneracy  of  the  Gaussian  distribution  over  belief  in  (67)  is  due  to  the  fact  that  Pk  is  a  deterministic  process.  It 
is  worth  noting  that  the  belief  mean  converges  to  the  T-periodic  belief  E [bk+r]  =  M[bk\  =  (xp,Pk).  Hence,  the  Lemma  1 
follows,  as  it  is  the  same  as  Lemma  7,  where  we  have: 
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Pk,  22 
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(68) 
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Appendix  II 
proof  of  Lemma  3 

Proof:  Let  us  consider  the  state  space  model  of  a  T-periodic  linear  system  of  interest  as  follows: 

xk+ 1=  AkXk  +  B kuk  +  GkWk,  wk  ~  AT(0,  Q k)  (69a) 

zk=  H kxk  +  Vk,  vk  ~  J\T( 0,  It*).  (69b) 

Based  on  Lemma  1  and  Lemma  2,  if  (A,  B)  and  (A,  Q)  are  controllable  pairs,  where  GQGT  =  QQT,  and  if  (A,  H)  and 
(A,  Wj)  are  observable  pairs,  where  Wx  =  WjW^,  then  the  estimation  covariance  deterministically  tends  to  a  T-periodic 
stationary  covariance  Pk.  Therefore,  for  any  e  >  0,  after  a  deterministic  finite  time,  Pk  enters  the  e-neighborhood  of  the 
periodic  stationary  covariance,  i.e.,  ||  Pk  —  Pk  ||  m  ^  e  for  all  k  large  enough,  where  ||  •  ||  stands  for  an  appropriate  matrix 
norm. 

The  estimation  mean  dynamics,  however,  is  stochastic  and  is  as  follows  for  the  system  in  Eq.  (69): 

xk+i  =  Xls-\- 1  (A-k  B,L,  Kfc+i-^fc+iAfc)(^  x'k) 

+  Kfc+iHfe+iAfc(xfc  —  xp)  +  Kfc+iHfc+iGfc^  +  K.k+ivk+i 
—  Xk+1  (A-k  B^L/j;)^^  T  (A k  B/jjL/j  Kfc-i-iH^iA^)^ 

+  K^iHfc+iAfeXfc  +  K/e+iH/e+iG/clC/c  +  K/e+iU/c+1,  (70) 

where  the  Kalman  gain  K&  is: 

Kfe  =  Pfc-H^(HfcPfe-H^  +  Rfe)"1.  (71) 

Since  K&  is  full  rank  (due  to  the  condition  on  the  rank  of  H^)  for  all  k  and  since  v  and  w  are  Gaussian  noises,  (70)  induces 
an  irreducible  Markov  process  over  the  state  space  [23].  Thus,  if  we  have  a  stopping  region  for  the  estimation  mean  with 
size  e  >  0,  the  estimation  mean  process  will  hit  this  stopping  region  in  finite  time  [23],  with  probability  one,  i.e.,  for  a 
finite  v  G  X,  the  condition  \\xk  —  v||  <  e  is  satisfied  in  finite  time.  However,  v  can  be  chosen  in  a  way  that  maximizes  the 
absorption  probability  and  minimizes  the  hitting  time. 

Based  on  the  estimation  mean  dynamics  in  (70)  and  the  state  dynamics  in  Appendix  I,  if  the  estimation  mean  process 
and  state  process  start  from  x J  and  xo,  respectively,  such  that  E[Sq“]  =  xpk  and  E[xq]  =  xk  (which  indeed  is  the  case  in 
FIRM  due  to  the  usage  of  edge-controllers),  “the  mean  of  estimation  mean”  remains  on  xk,  i.e.,  E[x^]  =  xk,  for  all  k.  As 
a  result,  xpk  is  the  optimal  choice  for  the  center  of  stopping  region  and  thus,  the  condition  \\xk  —  xk\\  <  e  is  satisfied  in 
minimum  time  in  the  sense  of  “expected  value”. 

Combining  the  results  for  estimation  covariance  and  estimation  mean,  if  we  define  the  region  Bk  as  a  set  in  the  Gaussian 
belief  space  with  a  non-empty  interior  centered  at  (xk  , Pka),  then  the  belief  bk  =  (xk,Pk)  enters  region  U kBk  with 
minimum  finite  expected  time  with  probability  one.  To  decrease  the  number  of  nodes,  one  can  only  look  at  the  subsequence 
ba  •=  bka  =  (xk  ,  Pka)  and  Ba  :=  Bka  for  {fci,  &2,  •  *  *  ,  km}  C  {1,  2,  •  •  •  ,  T},  then  similarly  the  belief  ba  enters  region 
U aBa  in  finite  time  with  probability  one.  ■ 


20 


Robust  Online  Belief  Space  Planning  in  Changing  Environments: 
Application  to  Physical  Mobile  Robots 

Ali-akbar  Agha-mohammadi,  Saurav  Agarwal,  Aditya  Mahadevan, 

Suman  Chakravorty,  Daniel  Tomkins,  Jory  Denny,  Nancy  M.  Amato 


Abstract —  Motion  planning  in  belief  space  (under  motion  and 
sensing  uncertainty)  is  a  challenging  problem  due  to  the  compu¬ 
tational  intractability  of  its  exact  solution.  The  Feedback-based 
Information  RoadMap  (FIRM)  framework  made  an  important 
theoretical  step  toward  enabling  roadmap-based  planning  in 
belief  space  and  provided  a  computationally  tractable  version 
of  belief  space  planning.  However,  there  are  still  challenges  in 
applying  belief  space  planners  to  physical  systems,  such  as  the 
discrepancy  between  computational  models  and  real  physical 
models.  In  this  paper,  we  propose  a  dynamic  replanning  scheme 
in  belief  space  to  address  such  challenges.  Moreover,  we  present 
techniques  to  cope  with  changes  in  the  environment  (e.g., 
changes  in  the  obstacle  map),  as  well  as  unforeseen  large 
deviations  in  the  robot’s  location  (e.g.,  the  kidnapped  robot 
problem).  We  then  utilize  these  techniques  to  implement  the 
first  online  replanning  scheme  in  belief  space  on  a  physical 
mobile  robot  that  is  robust  to  changes  in  the  environment  and 
large  disturbances.  This  method  demonstrates  that  belief  space 
planning  is  a  practical  tool  for  robot  motion  planning. 

I.  Introduction 

Sequential  decision  making  under  uncertainty  is  a  key 
prerequisite  for  many  robotics  applications.  Consider  an 
autonomous,  low-cost  mobile  robot  that  is  subject  to  motion 
noise  and  lacks  exact  measurements  due  to  sensor  noise. 
Controlling  this  robot  and  planning  motions  for  it  is  an  in¬ 
stance  of  the  Partially-Observable  Markov  Decision  Process 
(POMDP)  [13],  [23]  problem,  which  is  a  formal  framework 
for  sequential  decision  making  under  uncertainty.  However, 
the  POMDP  problem  is  also  notorious  for  its  computational 
intractability.  Methods  such  as  [11],  [15],  [18],  [24],  [25] 
reduce  the  computation  burden  of  POMDPs  and  aim  to 
solve  more  challenging  and  realistic  problems.  Recently,  the 
Feedback-based  Information  RoadMap  (FIRM)  framework 
[3]  takes  an  important  theoretical  step  toward  realistic  scenar¬ 
ios  by  significantly  reducing  the  computational  complexity 
of  planning  under  uncertainty. 

Additionally,  handling  changes  in  the  environment  (e.g., 
obstacles),  changes  in  the  goal  location,  and  large  deviations 

Agha-mohammadi  is  with  the  Laboratory  for  Information  and  Decision 
Systems,  MIT,  Cambridge,  MA,  02139.  Agarwal  and  Chakravorty  are  with 
the  Dept,  of  Aerospace  Engineering,  and  Mahadevan,  Tomkins,  Denny, 
and  Amato  are  with  the  Dept,  of  Computer  Science  and  Engineering, 
Texas  A&M  University,  TX,  77843,  USA.  Emails:  aliagha@mit .  edu 
{sauravag, mahadeven, schakrav, kittsil, jorydenny 
amato}@tamu  .edu.  This  research  supported  in  part  by  AFOSR  Grant 
FA9550-08- 1-0038  and  by  NSF  awards  CNS-0551685,  CCF-0833199, 
CCF-0830753,  IIS-0916053,  IIS-0917266,  EFRI-1240483,  RI-1217991,  by 
NIH  NCI  R25  CA090301-11,  by  Chevron,  IBM,  Intel,  Oracle/Sun  and  by 
Award  KUS-C1-016-04,  made  by  King  Abdullah  University  of  Science 
and  Technology  (KAUST).  J.  Denny  supported  in  part  by  an  NSF  Graduate 
Research  Fellowship. 


(b) 

Fig.  1.  (a)  A  picture  of  robot  (iRobot  Create)  in  the  operating  environment. 
Landmarks  can  be  seen  on  the  walls,  (b)  Floorplan  of  the  environment,  in 
which  experiments  are  conducted. 

in  the  robot’s  location  calls  for  online  planning  in  uncertain, 
partially  observable  environments.  However,  when  dealing 
with  real-world  physical  systems,  POMDP-based  methods, 
including  FIRM,  encounter  another  important  challenge: 
discrepancy  between  real  models  with  the  models  used  for 
computation.  Such  discrepancies  can  lead  to  deviations  from 
the  desired  plan.  Moreover,  changes  in  the  environment  and 
large  disturbances  are  other  important  challenges  that  needs 
to  be  handled.  One  strategy  to  address  this  problem  is  an 
ability  to  dynamically  replan  in  belief  space.  In  this  paper, 
we  propose  a  principled  rollout-based  extension  of  FIRM 
planning  to  facilitate  its  application  to  real-time  stochastic 
(re)planning  problems,  and  deal  with  changes  in  the  envi¬ 
ronment  and  large  disturbances  in  the  robot’s  state. 

In  the  main  body  of  POMDP  literature,  in  particular 
sampling-based  methods,  the  computed  solution  depends  on 
the  initial  belief  [4],  [7],  [9],  [14],  [21],  [28]  (sometimes 
referred  to  as  single-query  solvers).  Therefore,  in  replanning 
(planning  from  a  new  initial  belief)  almost  all  the  compu¬ 
tations  need  to  be  done  again,  which  prohibits  their  usage 
in  cases  where  real-time  dynamic  replanning  schemes,  such 
as  Receding  Horizon  Control  (RHC),  are  needed.  However, 
multi-query  methods  such  as  FIRM  [2],  [3]  provide  a  con¬ 
struction  mechanism  independent  of  the  initial  belief  of  the 
system.  As  a  result,  they  are  suitable  methods  to  be  used  for 
dynamic  replanning  purposes. 

Trajectory  optimization-based  methods  can  also  be  used 
for  replanning  in  an  RHC  scheme.  The  RHC  framework  was 
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originally  designed  for  deterministic  systems.  The  most  com¬ 
mon  approach  is  to  approximate  the  stochastic  system  with  a 
deterministic  one  by  replacing  the  uncertain  quantities  with 
their  mean  (or  maximum  likelihood)  values  [5].  Methods 
such  as  [8],  [10],  [12],  [20],  [27]  fall  into  this  category  and 
can  be  used  in  the  RHC  setting;  they  replace  future  random 
observations  with  their  deterministic  maximum  likelihood 
value.  However,  in  this  form  of  RHC,  the  optimization  is 
carried  out  only  within  a  limited  horizon.  Also,  removing 
the  system’s  stochasticity  may  lead  to  unreliable  plans. 

The  main  contributions  of  this  paper  are  threefold. 

•  We  propose  a  principled  method  for  real-time  replan¬ 
ning  in  belief  space  by  extending  the  idea  of  the  rollout 
policy  [5]  to  belief  space  using  FIRM.  This  method 
considers  all  possible  future  observations. 

•  We  propose  techniques  such  as  a  “lazy  feedback  evalu¬ 
ation”  algorithm  to  react  to  changes  in  the  environment 
as  well  as  large  disturbances. 

•  We  implement  the  proposed  belief  space  planning 
scheme  on  a  physical  robotic  system  as  an  application 
of  the  FIRM  framework.  We  demonstrate  the  robustness 
of  the  method  to  changes  in  the  environment,  failures 
in  the  sensory  system,  and  large  deviations. 

These  results  lay  the  groundwork  for  further  application  of 
the  theoretical  POMDP  framework  to  practical  applications, 
thus  moving  toward  long-term  autonomy  in  robotic  systems. 

II.  Problem  Statement  and  Target  Application 

We  aim  to  design  a  belief  space  planner  that  can  han¬ 
dle  uncertainties  associated  with  a  typical  low-cost  robot. 
Moreover,  the  planner  needs  to  be  able  to  replan  in  real¬ 
time  so  that  it  can  cope  with  changes  in  the  environment  as 
well  as  deviations  resulting  from  model  discrepancies,  large 
disturbances,  and  sensor  failures. 

To  formally  define  the  problem,  we  start  by  defining  the 
concept  of  belief  and  policy.  Consider  a  system  whose  state, 
control,  and  motion  noise  are  denoted  by  Xk,  Uk,  and  Wk, 
respectively,  at  the  fc-th  time  step.  Let  us  denote  the  state 
evolution  model  by  Xk+i  =  f(xk,Uk,Wk)-  In  a  partially 
observable  environment,  the  exact  value  of  system  state  Xk 
is  not  known.  However,  we  can  get  the  measurement  (or 
observation)  vector  Zk  at  the  fc-th  time  step  through  sensors. 
Let  us  denote  the  measurement  model  by  Zk  =  h(xk,Vk), 
where  Vk  denotes  sensing  noise.  Therefore,  the  only  available 
data  for  decision  making  at  the  fc-th  time  step  are  the 
observations  we  have  received  and  the  controls  we  have 
applied  up  to  that  time  step,  i.e.,  TLk  =  ^0:/c-i}  = 

{z0,  z i,  •  •  •  ,  Zk,  •  •  •  ,  Uk- 1}.  A  filtering  module  can  en¬ 
code  this  data  into  a  probability  distribution  over  all  possible 
system  states  bk  =  p(xk\TLk),  which  is  referred  to  as  the 
belief  or  information- state.  Therefore,  the  action  Uk  can  be 
taken  based  on  the  belief  bk  using  a  policy  (planner)  i T&,  i.e., 
uk  =  7 Tk(bk).  In  Bayesian  filtering,  belief  can  be  computed 
recursively  based  on  the  last  action  and  current  observation, 
bk+i  =  T(bk,uk,zk+ 1)  [5],  [26], 

To  find  the  policy  iik,  we  need  to  define  the  objective  of 
planning.  Although  the  objective  function  can  be  general, 


the  cost  function  we  will  use  in  our  experiments  includes 
the  localization  uncertainty,  control  effort,  and  elapsed  time. 

c(bk,uk )  =  CPtr{Pk)  +  Cu|KII  +Ct,  (1) 

where  tr (P&)  is  the  trace  of  estimation  covariance.  The  norm 
of  the  control  signal  \\uk\\  denotes  the  control  effort,  and  £ t 
is  present  in  the  cost  to  penalize  each  time  lapse.  Coefficients 
CP,  Cu ,  and  ( t  are  user-defined  task-dependent  scalars  that 
combine  these  costs  to  achieve  a  desirable  behaviour.  In  the 
presence  of  a  constraint  set  F  (e.g.,  obstacles),  we  assume 
that  the  task  fails  if  the  robot  violates  these  constraints  (e.g., 
collides  with  obstacles).  Therefore,  in  case  of  failure,  the 
running-sum  of  costs  (cost-to-go),  i.e.,  J(F)  =  Ylt?  c(b,u) 
is  set  to  a  suitably  high  cost-to-go. 

Planning  under  uncertainty  is  defined  as  finding  a  sequence 
of  policies  7ro:0o(*)  =  {tti(-),  *  ’ '  }•  Therefore,  the 

original  problem  of  stochastic  control  with  imperfect  state 
information  is  defined  as  follows: 

Problem  1.  (POMDP)  The  problem  of  stochastic  control 
with  imperfect  state  information,  or  the  Partially -Observable 
Markov  Decision  Process  (POMDP)  problem,  is  defined  as 
the  following  optimization  over  the  policy  space: 

oo 

7T0:oo(-)  =  arg  min  [c(bk,nk{bk))]  (2) 

n°;“  k= o 

s.t.  bk+i  =  T(bk,TTk(bk),Zk),  zk  ~  p(zk\xk) 

Xk+ 1  =  f(xk,  IT k(bk),wk ),  wk  ~  p(wk \xk,  irk(bk )) 

where,  is  the  space  of  all  possible  policies  at  time  step 
k,  i.e.,  t rk  e  Uk. 

In  the  infinite  horizon  case,  the  solution  is  a  stationary 
policy  7rs,  i.e.,  iri  =  112  =  •  •  •  =  tts.  However,  Problem  1  is 
written  in  a  more  general  setting  to  emphasize  the  connection 
with  rollout  policy,  discussed  further  below.  Solving  the 
POMDP  problem  is  computationally  intractable  over  con¬ 
tinuous  state,  action,  and  observation  spaces.  However,  the 
main  problem  that  this  paper  aims  to  solve  is  the  following: 

Problem  2.  (Re-Solving  POMDPs  in  real-time)  In  case  of  a 
change  in  the  failure  set  F  (e.g.  obstacles)  or  large  deviation 
in  the  system's  belief,  re-solve  Problem  1  in  real-time. 

This  paper  aims  at  solving  Problem  2  by  exploiting  FIRM, 
which  re-use  the  computations  performed  for  solving  the 
POMDP  problem  a  priori  and  hence  can  deal  with  such 
changes  online. 

A.  Sample  Application  Scenario 

We  exercise  the  proposed  planner  in  an  office-like  environ¬ 
ment,  where  we  use  a  low-cost  iRobot  Create  platform  (Fig¬ 
ure  1(a)),  on  which  a  Dell  Latitude  laptop  with  an  on-board 
camera  is  mounted.  The  robot  obtains  noisy  measurements 
(relative  range  and  bearing)  from  unique  landmarks  that  are 
installed  in  the  environment.  The  desired  behaviour  for  the 
planner  is  to  guide  the  robot  to  a  goal  through  those  regions 
of  the  environment  where  the  robot  can  better  localize  itself 
and  hence  better  avoid  collisions.  Most  importantly,  the 


planner  needs  to  be  able  to  replan  online  so  that  it  can  handle 
changes  in  the  environment  and  deviations  resulting  from 
model  discrepancies,  large  disturbances,  and  sensor  failures. 
We  briefly  discuss  the  environment,  robot  model,  and  sensory 
system.  More  detailed  descriptions  can  be  found  in  [1]. 

Environment:  The  specific  environment  for  conducting 
experiments  is  the  fourth  floor  of  the  Bright  building  at 
Texas  A&M  University.  A  floorplan  is  shown  in  Fig.  1(b). 
The  hallway  (yellow)  and  the  experiment  region  (blue)  are 
highlighted.  The  blue  region  contains  a  large  cluttered  office 
(room  407)  with  several  doors. 

System  model  (robot  and  sensors):  We  use  an  iRobot 
Create  (Fig.  1(a)),  whose  state  Xk  =  {*k,yk,0k)T  encodes 
its  2D  position  and  heading  angle  at  the  k- th  time  step. 
The  state  evolution  model  Xk+i  =  f(xk^Uk^Wk)  is  the 
unicycle  model,  where  the  control  command  u &  consists  of 
the  linear  and  angular  velocities  Uk  =  (V/C,cc/C)T .  Motion 
noise  w &  ~  A/"(0,  Q k)  gets  added  to  the  control  signal  (see 
[1]  for  details).  For  sensing  purposes,  we  use  the  laptop’s 
on-board  camera  to  detect  landmarks  (with  unique  black  and 
white  patterns)  that  are  placed  at  known  locations  on  walls 
(Fig.  1(a)).  Denoting  the  j- th  landmark  position  as  JL,  the 
obtained  measurement  is  the  relative  range  and  bearing  to 
the  landmark: 

3zk  =  [\\3dk\\,atan2(3d2k,Jdlk)  -  0}T  +  Jv,  3v~Af( 0,JR), 

where  jdk  =  [jdik,jd2k]T  :=  [xk,yk]T  ~  Lj.  Experimen- 
tally,  we  have  found  that  the  intensity  of  measurement  noise 
iy  increases  with  the  distance  from  the  j-th  landmark  and 
the  incidence  angle.  The  incidence  angle  refers  to  the  angle 
between  the  line  connecting  the  camera  to  a  landmark  and 
a  surface  normal  to  the  wall  on  which  the  landmark  is 
mounted.  Denoting  the  incident  angle  by  0  G  [— 7r/2,  7t/2], 
we  model  the  sensing  noise  associated  with  the  j-th  landmark 
as  a  zero  mean  Gaussian  whose  covariance  is 

JRk  =  diag((rjrdfdk\\+VrJ(pk\  +crl)2, 

(Ved\\jdk\\  +J?0j<£fe|  +of)2)  (3) 

In  our  implementation,  we  use  r\Td  =  0.1,  7yr0  =  0.01, 
(jrb  =  0.05m,  r]od  =  0.001,  tjq^  =  0.01,  and  ab  =  2.0deg. 
The  full  vector  of  measurements  2  is  the  concatenation  of 
measurements  from  visible  landmarks. 

III.  Overview  of  FIRM 

In  this  section,  we  briefly  review  the  Feedback-based 
Information  RoadMap  (FIRM)  framework  [2],  [3].  However, 
the  concrete  realization  of  FIRM  constructed  for  conducting 
the  experiments  is  detailed  in  [1].  An  Information  RoadMap 
(IRM)  is  a  “multi-query”  graph  in  belief  space  constructed 
independent  of  the  initial  belief  space.  Therefore,  the  in¬ 
tractable  belief  MDP  problem  can  be  reduced  to  a  tractable 
MDP  problem  on  this  graph.  Each  node  in  an  IRM  is  a 
small  region  B  =  {b  :  ||5  —  b\\  <  e}  around  a  sampled  belief 
b.  We  denote  the  i-th  node  by  Bl  and  the  set  of  nodes  by 
V  =  {B1}.  Each  edge  in  an  IRM  is  a  local  controller.  In 
FIRM,  each  edge  (local  controller)  is  a  feedback  controller 
whose  goal  is  to  drive  the  belief  into  the  target  node  of  the 


edge.  We  denote  the  edge  (controller)  between  nodes  i  and 
j  by  /iu  and  the  set  of  edges  by  M  =  {/A7 } .  A  policy  1 t9 
on  the  graph  is  a  mapping  from  graph  nodes  to  edges;  i.e., 
7T5,  :  V  — >  M.  Denote  the  set  of  all  possible  policies  as  n^. 

Having  such  a  graph  in  belief  space,  we  can  form  the 
POMDP  on  the  FIRM  graph  (so-called  FIRM  MDP): 

00 

7 t9  =  arg  min  E  Y,C9(Bn,ir°(Bn))  (4) 

n= 0 

where,  Bn  is  the  n-th  visited  node,  and  fin  is  the  edge  taken 
at  Bn.  C9(B1/jJ)  :=  Y^=oc(bk,  l^{bk))  is  the  generalized 
cost  of  taking  local  controller  fi  at  node  B  centered  at  bo. 

We  incorporate  the  failure  set  in  planning  by  adding  a 
hypothetical  FIRM  node  B°  =  F  to  the  list  of  FIRM  nodes. 
As  the  FIRM  MDP  in  Eq.(4)  is  defined  over  the  finite  set 
of  nodes,  we  can  solve  it  by  computing  the  graph  cost-to-go 
through  solving  the  following  dynamic  programming: 

N 

J9(Bi)=Ymn{C9(B\/i)  +  V  P 9 (B^\B\  ii)J9 (B^)}  (5) 

[i  z ' 

7=0 

where  F9(B~/\B'1,  fi)  is  the  probability  of  reaching  B 7  from 
Bl  under  fi.  The  failure  and  goal  cost-to-go’s  (i.e.,  J9(B° ) 
and  J9(B9°al))  are  set  to  a  suitably  high  positive  value  and 
zero,  respectively.  Accordingly,  the  replanning  algorithms, 
when  start  or  goal  changes,  are  presented  in  Algorithms  1 
and  2.  For  a  more  detailed  description  of  FIRM,  see  [1]. 

Algorithm  1:  (Re)plan_from 

1  input  :  Start  belief  bo,  cost-to-go  J9( •),  nodes  Y={B l} 

2  output  :  Next  Local  Controller  /a* 

3  Find  r  neighboring  nodes  9T  =  {Bl}ri=1  to  bo; 

4  Set  J*(B)  =00; 

5  for  B  G  do 

6  Construct  local  planner  fi  from  bo  to  B; 

7  Compute  transition  cost  C(6o,a0  and  probability 

p(b|&o,m); 

N 

8  if  C(&o,aO  +  E  P0B7|&o,At)Js0B7)  <  J*{B)  then 

7=0 

9  j*(b)  =  c(bo,ii)  +E^=0p(£7IWVs(£7); 

10  fi*  - 11; 

11  return  fB; 

Algorithm  2:  (Re)plan_to 

1  input  :  Goal  node  B9°al ,  FIRM  Graph  Q  =  {V,  M} 

2  output  :  FIRM  feedback  7 t9 

3  Add  B9°al  to  the  graph;  update  V  and  M  accordingly; 

4  Compute  the  cost-to-go  J9  and  feedback  it9  over  the 
FIRM  nodes  by  solving  the  MDP  in  Eq.  (5); 

5  return  it9; 


IV.  Dynamic  Replanning  in  Belief  Space 

In  this  section,  we  first  discuss  the  extension  of  the 
Receding  Horizon  Control  (RHC)  and  Rollout  Policy  (ROP) 
[5]  to  belief  space.  Then  we  propose  an  ROP  based  on  FIRM 


that  can  cope  with  changes  in  the  environment  as  well  as 
large  deviations. 

RHC  in  belief  space:  Receding  horizon  control  (often 
referred  to  as  rolling-horizon  or  model-predictive  control) 
was  originally  designed  for  deterministic  systems  (to  cope 
with  model  discrepancy).  For  stochastic  systems,  where  the 
closed-loop  (feedback)  control  law  is  needed,  the  best  for¬ 
mulation  of  the  RHC  scheme  is  a  subject  of  current  research 
[8],  [16],  [22].  In  the  most  common  form  of  RHC  [5], 
the  stochastic  system  is  approximated  with  a  deterministic 
system  by  replacing  the  uncertain  quantities  with  their  typical 
values  (e.g.,  maximum  likelihood  value).  In  belief  space 
planning,  the  quantities  that  inject  randomness  into  belief 
dynamics  are  unknown  future  observations.  Thus,  one  can  re¬ 
place  random  observations  with  their  deterministic  maxi¬ 
mum  likelihood  value  z™1 ,  where  z™1  :=  argmax2p(z/c|x^) 
in  which  xd  is  the  nominal  deterministic  value  for  the  state 
that  results  from  replacing  the  motion  noise  w  by  zero; 
i.e.,  xd+1  =  f(xd,7Tk{bd)iO).  The  deterministic  belief  bd 
is  then  used  for  planning  in  the  receding  horizon  window. 
At  every  time  step,  the  RHC  scheme  performs  a  two- stage 
computation.  To  describe  these  stages,  let  us  assume  we  are 
at  step  n  and  the  belief  is  bn.  At  the  first  stage,  the  RHC 
scheme  for  deterministic  systems  solves  an  open-loop  control 
problem  (i.e.,  returns  a  sequence  of  actions  uo-.t)  over  a 
fixed  finite  horizon  T  by  solving  the  following  optimization 
problem:  T 

u0:T  =  argmin  V'c(6fe,u/C) 

Uo:T  fe= o 

S.t.  bdk+l=T(bdk,uk,z£l1),  b$  =  bn 

zk+i  =  argmaxp(2:|x^+1) 

xk+i  =  f(xk,uk,0),  (6) 

In  the  second  stage,  it  executes  only  the  first  action  uq 
and  discards  the  remaining  actions  in  the  sequence  uo-.t- 
However,  since  the  actual  observation  is  noisy  and  is  not 
equal  to  the  zml ,  belief  6n+i  will  be  different  than  bd. 
Subsequently,  RHC  performs  these  two  computations  from 
the  new  belief  bn+ 1.  In  other  words,  RHC  computes  an 
open  loop  sequence  uo-.t  from  this  new  belief.  This  process 
continues  until  the  belief  reaches  a  desired  belief  location. 
Algorithm  3  recaps  this  procedure. 


Algorithm  3:  RHC  for  Partially-observable  stochastic 
systems 

1  input  :  Initial  belief  bcurrent  G  X,  Bgoai  C  ® 

2  while  bcurrenf  ^  Bg0ai  do 

3  uo:t  =  Solve  the  optimization  in  Eq.(6)  starting 
from  b$  =  bcurrent, 

4  Apply  the  action  uo  to  the  system; 

5  Observe  the  actual  z; 

6  Compute  the  belief  bcurrent  G-  r(bcurrenUuo,  z); 


State-of-the-art  methods  such  as  [27]  and  [19]  utilize  this 
form  of  RHC  in  belief  space.  This  framework  is  also  called 


Partially-Closed  Loop  RHC  (PCLRHC)  [27]  since  it  partially 
exploits  some  information  about  the  future  observations  (i.e., 
zml )  and  does  not  fully  ignore  them. 

Issues  with  RHC:  There  are  some  issues  regarding  the 
presented  form  of  the  RHC  framework:  First,  due  to  the  lim¬ 
ited  horizon  and  ignoring  the  cost-to-go  beyond  the  horizon, 
the  method  may  get  stuck  into  pitfalls  by  choosing  actions 
that  guide  the  robot  toward  “favorable”  states  (with  low  cost) 
in  the  near  future  followed  by  a  set  of  “unfavorable”  states 
(with  a  high  cost)  in  the  long  run.  Second,  the  presented  form 
of  RHC  ignores  the  stochasticity  of  the  system  within  the 
horizon,  which  may  lead  to  inaccurate  approximations  of  the 
cost  and  unreliable  control  actions.  To  overcome  these  issues, 
researchers  have  proposed  variants  of  RHC  and  different 
frameworks  based  on  the  idea  of  repeated  planning  [5].  Here, 
we  discuss  such  a  framework  called  “rollout  policy”  [5]  and 
aim  to  realize  it  in  belief  space  using  the  FIRM  framework. 

Rollout  policy  in  belief  space:  A  class  of  methods  that 
aims  to  reduce  the  complexity  of  the  stochastic  planning 
problem  in  Eq.2  is  the  class  of  Rollout  Policies  (ROP)  [5], 
which  are  more  powerful  than  the  described  version  of  RHC 
in  the  following  sense:  First,  they  search  for  a  sequence  of 
policies  (instead  of  open-loop  controls)  within  the  horizon, 
and  do  not  approximate  the  system  with  a  deterministic 
one.  Second,  they  use  a  suboptimal  policy,  called  the  “base 
policy,”  to  compute  a  cost-to-go  function  J  that  approximates 
the  true  cost-to-go  beyond  the  horizon.  In  other  words,  at 
each  step  of  the  rollout  policy  scheme,  the  following  closed- 
loop  optimization  is  solved: 


7Tq  ;t(')  =  argminE 


r  c(t>k,  Kk(bk))  +  J {bT+ 1) 


,ft= o 


(7) 


s.t.  bk+1  =  T(bk,TTk(bk),zk),  zk~p(zk \xk) 

xk+i  =  f(xk,nk(bk),wk),  wk  ~  p(wk\xk,irk(bk)) 


Then,  only  the  first  control  law  7To  is  used  to  generate  the 
control  signal  uq  and  the  rest  of  the  policies  are  discarded. 
Similar  to  RHC,  after  applying  the  first  control,  a  new 
sequence  of  policies  is  computed  from  the  new  point.  The 
rollout  algorithm  is  shown  in  Algorithm  4. 

Algorithm  4:  Rollout  algorithm  in  Belief  Space: 

1  input  :  Initial  belief  bcurrent  G  B,  Bgoai  C  B 

2  while  ^current  ^  Bgoai  do 

3  7To :T  =  Solve  optimization  in  Eq.(7)  starting  from 

bo  —  ^current, 

4  Apply  the  action  uo  =  tt(^o)  to  the  system; 

5  Observe  the  actual  z; 

6  Compute  the  belief  bcurrent  <-  r(bcurrenUuo,  z); 


Although  the  rollout  policy  in  the  belief  space  efficiently 
reduces  the  computational  cost  compared  to  the  original 
POMDP  problem,  it  is  still  formidable  to  solve,  since  the 
optimization  is  carried  out  over  the  policy  space.  Moreover, 
there  should  be  a  base  policy  that  provides  a  reasonable  cost- 
to-go  J.  We  propose  a  rollout  policy  in  the  belief  space  based 
on  the  FIRM-based  cost-to-go. 


FIRM-based  Rollout  Policy:  In  the  FIRM-based  rollout 
policy,  we  adopt  the  FIRM  policy  as  the  base  policy  of  the 
rollout  algorithm.  Accordingly,  the  cost-to-go  of  the  FIRM 
policy  will  be  used  as  the  cost-to-go  beyond  the  horizon. 
Now,  if  we  have  a  dense  FIRM  graph  such  that  FIRM  nodes 
partition  the  belief  space  (i.e.,  U =  B),  then  at  the  end 
of  the  horizon,  the  belief  6t+i  belongs  to  a  FIRM  node 
B  from  which  the  FIRM  cost-to-go  is  available.  However, 
in  practice,  when  the  FIRM  nodes  cannot  cover  the  entire 
belief  space,  we  need  to  make  sure  that  a  truncated  policy 
can  drive  the  belief  into  a  FIRM  node  at  the  end  of  horizon. 
Nevertheless,  since  the  belief  evolution  is  random,  we  may 
not  be  able  to  guarantee  that  the  belief  reaches  a  FIRM  node 
at  the  end  of  a  deterministic  horizon  T.  Therefore,  instead 
of  truncating  the  policy  over  a  fixed  time,  we  truncate  the 
policy  once  the  belief  reaches  a  pre- specified  stopping  region 
(which  happens  in  a  random  time  denoted  by  T)  as  follows: 

r  r  _ 

7T0:oo(-)  =  arg  min  E  S'c(bk,Ttk(bk))  +  J(br+ 1) 

n°:~  L=o  J 

s.t.  bk+1  =  T(bk,TTk(bk),Zk),  Zk~p(zk \xk) 

xk+l  =  f(xk,TTk(bk),Wk),  Wk  ~  p(wk\xk,TTk(bk)) 
br+1  €  u jB\  (8) 

where  for  67-+ 1  G  B J  we  have 

J(bT+ 1)  =  J9(Bj)  (9) 

The  last  condition  in  Eq.8  can  be  written  more  rigorously  as 
P(67-+iG  U jBi  1 7r )  =  1  for  a  finite  T.  Also,  as  noted  in  Eq.(9), 
it  is  worth  noting  that  the  FIRM-based  cost-to-go  J9{-)  plays 
the  role  of  the  cost-to-go  beyond  the  horizon  J(-). 

Therefore,  in  solving  the  FIRM-based  rollout  policy  prob¬ 
lem,  we  aim  to  find  a  sequence  of  policies  that  ends  up  in 
a  FIRM  node  and  minimizes  the  cost  in  Eq.8.  To  find  this 
optimal  policy,  we  parametrize  the  policy  space  and  perform 
minimization  over  the  parameter  space. 

In  our  implementation,  we  adopt  a  variant  of  the  Open- 
Loop  Feedback  Control  (OLFC)  scheme  [5]  along  with  a 
Kalman  Filter  as  the  belief  controller.  In  this  variant  of 
OLFC,  for  a  given  v,  we  compute  an  open-loop  control 
sequence  starting  from  the  current  estimation  mean  and 
ending  at  v.  Then,  we  apply  a  truncated  sequence  of  the 
first  l  controls  (/  =  5  in  our  experiments).  This  process 
repeats  every  l  steps  until  we  reach  the  graph  node.  More 
details  can  be  found  in  [1].  Therefore,  the  policy  can  be 
characterized  by  the  next  node;  i.e.,  7 r(-;v).  Thus,  to  solve 
the  optimization  in  Eq.7  we  search  for  the  FIRM  node  = 
(v-i,  pi)  whose  mean,  i.e.,  vJ,  leads  to  the  best  local  policy 
7r(-;  vj).  Accordingly,  we  implement  the  rollout  technique  in 
Algorithm  4. 

V.  Replanning  in  Changing  Environments  and 
Presence  of  Large  Deviations 

In  this  section,  we  discuss  how  we  handle  changes  in 
the  obstacle  map  and  large  deviations  in  the  robot’s  belief. 
In  general,  handling  these  cases  in  belief  space  is  a  big 
challenge  as  they  require  online  updating  of  the  planning 


structure  in  belief  space.  It  is  important  to  note  that  it  is 
the  graph  structure  of  FIRM  that  makes  such  an  update  and 
replanning  feasible  in  real-time.  The  graph  structure  of  FIRM 
allows  us  to  locally  change  collision  probabilities  without 
affecting  the  rest  of  the  graph  (i.e.,  properties  of  different 
edges  on  the  graph  are  independent  of  each  other).  It  is 
important  to  note  that  such  a  property  is  not  present  in  other 
state-of-the-art  belief  space  planners,  including  SARSOP 
[15],  BRM  (Belief  Roadmap  Method)  [21],  or  LQG-MP  [28]. 
In  those  methods,  collision  probabilities  and  costs  on  all 
edges  (number  of  possible  edges  is  exponential  in  the  size 
of  underlying  PRM)  need  to  be  re-computed. 

A.  Lazy  Feedback  Evaluation  in  Changing  Environments 

To  adapt  the  proposed  framework  to  handle  changing 
environments,  we  rely  on  lazy  evaluation  methods.  Inspired 
by  the  lazy  evaluation  methods  for  PRM  frameworks  [6], 
we  propose  a  variant  of  the  lazy  evaluation  methods  for 
evaluating  the  generated  feedback  law.  The  basic  idea  is  that 
at  every  node  the  robot  re-evaluates  only  the  next  edge  that  it 
needs  to  take  or  a  limited  set  of  edges  in  the  vicinity  of  the 
robot.  By  re-evaluation,  we  mean  it  re-computes  collision 
probabilities  along  those  edges.  If  there  is  a  significant 
change  in  the  local  collision  probabilities,  then  the  dynamic 
programming  problem  is  re-solved  and  a  new  feedback  tree  is 
computed.  Otherwise,  the  feedback  tree  remains  unchanged 
and  the  robot  keeps  following  it.  This  lazy  evaluation  scheme 
can  be  performed  in  real-time.  The  method  is  outlined  in 
Algorithm  5. 

Algorithm  5:  Lazy  Feedback  Re-Evaluation 

1  input  :  Feedback  tt9,  current  belief  bcurrent 

2  output  :  Updated  feedback  tt9 

3  Update  the  obstacles  map; 

4  if  there  is  a  change  in  map  then 

5  Wg  Retrieve  the  sequence  of  nominal  edges 
returned  by  feedback  up  to  horizon  /; 

6  forall  the  edges  p  G  W  do 

7  Re-compute  the  collision  probabilities 

L  Bnew(L>,  p)  from  the  start  node  B  of  edge; 

8  if  exists  p  G  W  such  that 

|P new(B,p)  -¥(B,p)\  >  a  then 

9  P  •P3 new 

10  |_  tt9  <-  Replan (bcurrent); 

11  return  ir9\ 

B.  Handling  Large  Disturbances  ( kidnapped  robot  problem ) 

In  robotics,  the  kidnapped  robot  problem  commonly  refers 
to  a  situation  where  an  autonomous  robot  in  operation  is 
carried  to  an  arbitrary  location.  This  problem  introduces 
different  challenges  such  as  (i)  how  to  detect  kidnapping, 
(ii)  how  to  localize  the  robot,  and  (iii)  how  to  control  the 
robot  to  recover  from  this  situation  and  accomplish  its  goal. 
The  third  part  of  this  problem  calls  for  online  replanning  in 
belief  space. 


Detecting  a  kidnapped  situation :  To  detect  the  kidnapped 
situation,  we  constantly  monitor  the  innovation  signal  Zk  = 
Zk—Zfr  (the  difference  between  actual  and  predicted  observa¬ 
tions).  Recall  that  in  our  setting  the  observation  at  time  step 
k  from  the  j-th  landmark  is  the  relative  range  and  bearing 
of  the  robot  to  the  j- th  landmark,  i.e.,  i  Zk  =  (frk^Ok)- 
The  predicted  version  of  this  measurement  is  shown  by 
We  monitor  the  following  measures  of 
the  innovation  signal: 

rk  =  max(|Jrfc  6k  =  max.{de  Ok,1 6^)),  (10) 

3  3 

where  de{0,6')  returns  the  absolute  value  of  the  smallest 
angle  that  maps  6  onto  O'.  Passing  these  signals  through  a 
low-pass  filter,  we  filter  out  the  outliers  (temporary  failures  in 
the  sensory  reading).  Denoting  the  filtered  signals  by  r&  and 
Ok,  we  monitor  the  conditions  fk  <  rmax  and  0^  <  0max.  If 
both  of  them  are  satisfied,  we  follow  the  FIRM  feedback  (i.e., 
we  are  in  the  Feedback  Following  Mode  (FFM)).  However, 
violation  of  either  of  these  conditions  means  that  the  robot 
is  constantly  observing  high  innovations,  and  thus  it  is  not  in 
the  location  that  it  was  supposed  to  be  (i.e.,  it  is  kidnapped). 
In  Section  VI,  we  show  the  innovation  signal  for  a  sample 
run  on  a  physical  robot.  In  our  implementation,  we  consider 
rmax  =  1  (meters)  and  0max  =  50  (degrees). 

Information  Gathering  Mode  (IGM):  Once  the  robot  de¬ 
tects  it  has  been  kidnapped,  the  estimation  covariance  is 
replaced  with  a  large  covariance  to  get  an  approximately 
uniform  distribution  over  the  state  space.  Then,  we  enter 
the  Information  Gathering  Mode  (IGM),  where  we  take 
small  and  conservative  steps  (e.g.,  turning  in  place  or  tak¬ 
ing  random  actions  with  small  velocities)  to  obtain  more 
measurements.  Once  the  robot  gets  these  measurements,  the 
localization  module  corrects  the  estimation  value  and  the 
innovation  signal  reduces.  When  conditions  fk  <  rma;E  and 
Ok  <  Ornax  are  satisfied  again,  we  exit  the  information 
gathering  mode. 

Post- IGM  replanning:  After  recovering  from  being  kid¬ 
napped,  controlling  the  robot  in  belief  space  remains  a 
significant  challenge  because  the  system  can  be  far  from 
where  it  was  expected  to  be.  However,  using  the  proposed 
method  and  assuming  the  FIRM  graph  has  enough  nodes 
distributed  well  in  the  space,  the  robot  needs  to  go  only  to 
a  neighboring  node  from  this  new  point.  Therefore,  there  is 
no  need  for  a  costly  replanning  procedure.  Indeed,  the  only 
required  computation  is  to  evaluate  the  cost  of  edges  that 
connect  the  new  start  point  to  the  neighboring  FIRM  nodes 
based  on  Algorithm  1. 

VI.  Experimental  Results 

In  this  section,  we  first  discuss  the  results  of  PRM  and 
FIRM-based  motion  planning  and  show  how  belief  space 
planning  can  improve  the  performance.  Then,  we  distinguish 
our  method  from  the  state-of-the-art  by  examining  and 
discussing  the  robustness  properties  of  the  proposed  method 
to  changes  in  the  obstacle  map,  and  to  large  deviations  in 
the  robot’s  location  and  the  goal  location.  The  experiments 


are  conducted  on  a  low-cost  iRobot  Create  equipped  with  a 
laptop  and  an  integrated  monocular  web-camera  (Fig.  1(a)). 

A.  Planning  with  PRM  and  FIRM 

The  goal  of  this  section  is  to  compare  the  performance  of 
FIRM  with  deterministic  planners  such  as  Medial  Axis  PRM 
(MAPRM)  [29].  The  solution  of  the  dynamic  programming 
problem,  i.e.,  i t9,  is  visualized  with  a  feedback  tree  (FT). 
For  each  node,  FT  contains  only  one  outgoing  edge  (p  = 
7 t9(B1)).  FT  is  rooted  at  the  goal  node. 

MAPRM-based  planning:  As  one  of  the  best  variants  of 
PRM  when  it  comes  to  collision  avoidance,  we  construct 
an  MAPRM  [29]  in  the  environment  (Fig.  2(a)).  As  is  seen 
in  Fig.  2(a),  the  path  with  maximum  obstacle  clearance 
(and  the  shortest  path)  is  the  one  through  the  front  door  of 
room  407  (see  Fig.  1(b)).  Therefore,  based  on  the  obstacle 
clearance,  MAPRM  leads  to  the  feedback  tree  shown  in 
Fig.  2(b)  that  guides  the  robot  through  the  front  door.  To 
execute  the  MAPRM  plan  we  design  LQG  controllers  to 
track  the  computed  path.  However,  due  to  the  lack  of  enough 
information  along  the  solution  path,  the  success  rate  of  this 
plan  is  27%  (27  runs  out  of  100  Monte  Carlo  runs  were 
successful)  and  the  robot  frequently  collides  with  obstacles. 

FIRM-based  planning:  In  planning  with  FIRM,  the  in¬ 
formation  distribution  in  the  environment  is  encoded  in  the 
planning  via  a  framework  which  leads  to  a  better  judgement 
of  the  narrowness  of  passages  in  the  belief  space.  Although 
in  this  environment  the  path  through  the  front  door  is  shorter, 
the  success  probability  of  traversing  through  the  back  door  is 
more  due  to  the  presence  of  more  information  sources.  Such 
knowledge  about  the  environment  is  reflected  in  the  FIRM 
cost-to-go  and  success  probability.  As  a  result,  it  generates 
a  policy  that  suits  the  application,  taking  into  account  the 
uncertainty,  and  available  information  in  the  environment. 
Solving  DP  on  the  FIRM  graph  gives  the  feedback  shown  in 
Fig.  2(c),  which  results  in  an  88%  success  probability. 

B.  Robustness  to  Changes  in  the  obstacle  map 

In  this  section,  we  investigate  the  robustness  of  the 
proposed  algorithm  to  changes  in  obstacles  for  a  physical 
system.  In  our  experiments,  we  consider  two  types  of  obsta¬ 
cles.  The  first  set  of  obstacles  (most  of  the  map)  are  static 
obstacles  such  as  walls.  The  second  class  of  obstacles  include 
those  that  discretely  change  their  state  such  as  doors  (state 
changes  between  “open”  or  “closed”)  in  the  environment. 
As  discussed  earlier,  handling  such  changes  is  a  challenge  in 
state-of-the-art  belief  space  planners  since  the  planner  cannot 
be  updated  locally  and  all  computation  for  constructing  the 
planner  needs  to  be  reproduced,  which  is  not  a  feasible 
operation  in  real-time.  The  main  focus  of  the  following 
experiments  is  to  demonstrate  how  our  method  can  replan 
in  real-time  when  faced  with  changes  in  the  obstacle  map. 

We  consider  the  environment  shown  in  Fig.  1(b).  The  start 
and  goal  locations  are  shown  in  Fig.  3(a).  We  construct  a 
PRM  in  the  environment  ignoring  the  changing  obstacles 
(e.g.,  assuming  all  doors  are  open).  Leveraging  PRM  to 
construct  a  FIRM  and  solving  the  dynamic  programming 
problem  on  it,  we  get  the  feedback  tree  shown  in  Fig.  3(a) 


(a)  (b)  (c) 

Fig.  2.  (a)  The  environment  including  obstacles  (blue),  free  space  (black),  and  landmarks  (white  diamonds)  on  the  walls  are  shown.  An  MAPRM  graph 

approximating  the  connectivity  of  free  space,  starting  point,  and  goal  point  are  shown,  (b)  The  feedback  tree  generated  by  solving  DP  on  MAPRM  is 
shown  in  yellow.  From  each  node  there  is  only  one  outgoing  edge  (in  yellow),  computed  by  DP,  guiding  the  robot  toward  the  goal.  Arrows  in  pink  coarsely 
represent  the  direction  on  which  the  feedback  guides  the  robot,  (c)  The  feedback  tree  generated  by  solving  DP  on  FIRM;  As  is  seen,  the  computed  feedback 
guides  robots  through  more  informative  regions  that  leads  to  more  accurate  localization  and  less  collision  probabilities. 


that  guides  the  robot  toward  the  goal  through  the  back-door 
of  room  407.  However,  the  challenge  is  that  the  door  may 
be  closed  when  the  robot  reaches  it,  and  there  may  be  new 
obstacles  in  the  environment.  The  robot  needs  to  replan  in 
real-time  once  it  encounters  such  changes  in  the  environment. 
For  details  on  the  obstacle  detection  mechanism  see  [1]. 

Figure  3(b)  shows  a  snapshot  of  our  run  when  the  robot 
detects  the  door  is  in  a  different  situation  than  expected.  As 
a  result,  the  robot  updates  the  obstacle  map  as  can  be  seen 
in  Fig.  3(b),  in  which  the  door  is  closed.  Accordingly,  the 
robot  replans  in  belief  space  based  on  Algorithm  5.  Figure 
3(b)  shows  the  feedback  tree  resulting  from  replanning.  As 
seen,  the  new  feedback  guides  the  robot  through  the  front 
door,  since  it  detects  the  back  door  is  closed.  The  video  of 
a  long  run  (see  Section  VI-D)  provides  more  detail  on  this 
procedure.  Moreover,  this  video  shows  the  robustness  of  the 
method  to  temporary  failures  in  the  perception  system  (e.g., 
missing  landmarks  due  to  blockages,  blur,  etc.),  which  is 
discussed  more  in  [1]. 

C.  Robustness  to  large  deviations 

In  this  section,  we  investigate  the  robustness  of  the  pro¬ 
posed  framework  in  dealing  with  large  deviations  in  the 
robot’s  position.  As  a  more  general  form  of  this  problem, 
we  consider  the  kidnapped  robot  problem  as  discussed  in 
the  previous  section.  The  need  for  online  replanning  in  belief 
space  makes  this  problem  challenging. 

Figure  4(a)  shows  a  snapshot  of  a  run  that  involves  two 
kidnappings  and  illustrates  the  robustness  of  the  planning 
algorithm  to  the  kidnapping  situation.  The  start  and  goal 
positions  are  shown  in  Fig.  4(a).  The  feedback  tree  (shown 
in  yellow)  guides  the  robot  toward  the  goal  through  the 
front  door.  However,  before  reaching  the  goal  point  the 
robot  gets  kidnapped  in  the  hallway  (cf.  Fig.  4(a))  and 
placed  in  an  unknown  location  within  room  407  (cf.  Fig. 
4(a)).  The  first  jump  in  4(b)  shows  this  deviation.  Once 
the  robot  recovers  from  being  kidnapped  (i.e.,  when  both 
innovation  signals  in  Fig.  4(b)  fall  below  their  corresponding 
thresholds),  replanning  from  the  new  point  is  performed. 
Feedback  guides  the  robot  toward  the  goal  point  from  within 
room  407.  However,  again,  before  robot  reaches  the  goal 
point,  it  is  kidnapped  and  placed  in  an  unknown  location  (see 
Fig.  4(a)).  The  second  jump  in  the  innovation  signals  in  Fig. 
4(b)  corresponds  to  this  kidnapping.  Again,  replanning  from 


(b) 

Fig.  3.  (a)  The  back  door  is  open  at  this  snapshot.  The  feedback  guides  the 
robot  toward  goal  through  the  back  door,  (b)  The  back  door  is  closed  at  this 
snapshot.  Robot  detects  the  door  is  closed  and  updates  the  obstacle  map 
(adds  door).  Accordingly  robot  replans  and  computes  the  new  feedback. 
The  new  feedback  guides  the  robot  through  the  front  door, 
the  new  point,  the  robot  follows  the  feedback  and  reaches 
the  goal  point. 

D.  A  Longer  and  more  complex  experiment 

We  next  demonstrate  the  ability  of  the  system  to  perform 
long-term  tasks  in  a  complex  scenario  that  consists  of  visiting 
several  goals  (each  time  therobot  reaches  a  goal,  a  user 
submits  a  new  goal).  The  replanning  ability  allows  the  robot 
to  change  the  plan  online  in  belief  space  as  the  goal  location 
changes.  Moreover,  the  robot  frequently  encounters  changes 
in  the  obstacle  map  (open/closed  doors  and  new  obstacles 
in  the  environment)  as  well  as  missing  information  sources 
and  kidnapped  robot  situations.  Thus,  the  robot  frequently 
needs  to  perform  a  replanning  operation  in  belief  space  to 
deal  with  such  frequent  changes.  A  25  minute  video  of  this 
run  is  recorded  and  available  in  [17]  (a  shorter  version  has 
been  submitted  along  with  the  paper)  that  shows  the  robot’s 
performance  in  this  complex  scenario.  In  this  video,  the 
robot  faces  three  changes  in  the  goal  location,  three  changes 
in  the  door’s  state  (open/closed),  several  new  obstacles  in 
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Fig.  4.  (a)  The  set  up  for  the  experiment  containing  two  kidnapping,  (b)  Innovation  signals  ffc  and  9k  during  this  run.  When  both  of  the  signals  are  below 
specified  thresholds  rmaa;  and  Omax  (dashed  red  lines),  robot  follows  the  FIRM  feedback.  Otherwise,  the  system  enters  the  information  gathering  mode. 


the  environment,  three  kidnapping  situations,  and  numerous 
failures  of  the  sensory  systems  due  to  missing  landmarks, 
blur  in  image,  and  etc. 

VII.  Conclusion 

In  this  paper,  we  present  an  application  of  the  FIRM 
motion  planning  method  to  a  physical  robotic  system.  This 
paper  proposes  a  robust  method  for  belief  space  planning 
based  on  efficient  online  replanning.  Such  replanning  is  a  key 
ability  in  handling  discrepancies  between  real  world  mod¬ 
els  and  computational  models,  changes  in  the  environment 
and  obstacles,  large  deviations,  and  changes  in  information 
sources.  We  implemented  this  belief  space  planner  on  a 
physical  system  and  demonstrate  the  robustness  to  such 
discrepancies  that  occur  in  practice.  We  believe  this  work 
provides  an  important  step  toward  making  POMDP  methods 
applicable  to  real  world  robotic  systems. 
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Multi-agent  Generalized  Probablistic  RoadMaps  :  MAGPRM 

Sandip  Kumar  and  Suman  Chakravorty 


Abstract — In  this  paper,  the  generalized  motion  planning 
algorithm  (Generalized  PRM  :  GRPM  [1,  2,  3])  is  extended 
to  a  class  of  multi-agent  motion  planning  problem  in  presence 
of  process  uncertainty  and  stochastic  maps. 

The  proposed  algorithm  is  a  hierarchical  approach  towards 
constructing  a  passive  coordination  strategy  which  utilizes  an 
existing  multiple  traveling  salesman  problem  (MTSP)  solution 
methodology  in  conjunction  with  the  GPRM  framework  to  solve 
the  multi-agent  motion  planning  problem.  The  proposed  algo¬ 
rithm  is  generalized  to  tackle  multi-agent  problems  involving 
heterogeneous  agents.  The  algorithm  is  used  to  solve  multi¬ 
agent  motion  planning  problems  involving  2-dimensional  and 
3-dimensional  agents  in  stochastic  maps  with  uncertainty  in  the 
motion  model.  Results  indicate  that  the  algorithm  successfully 
solves  the  problem  under  uncertainty,  and  generates  a  solution 
having  high  probability  of  success.  It  also  demonstrates  that 
the  algorithm  is  scalable  in  terms  of  number  of  start  and  goal 
locations,  the  number  of  agents  and  their  dynamics. 

I.  Introduction 

The  motion  planning  problem  for  multiple  agents,  in  envi¬ 
ronments  with  obstacles  is  a  challenging  problem  in  robotics. 
The  challenge  arises  due  to  searching  for  a  solution  in  the 
joint  state  and  control  spaces  of  the  agents.  Introduction 
of  motion  model  uncertainty  increases  the  complexity  even 
further.  Finding  a  solution  to  the  coordination  problem  [4], 
and  hence,  developing  coordination  strategies  for  multi-agent 
systems  in  the  presence  of  uncertainties  has  been  a  challenge. 

The  multiple  agents  motion  planning  problem  in  the 
presence  of  uncertainty  and  coordination  strategies  has  been 
addressed  previously  in  the  literature.  Some  of  the  related 
work  is  mentioned  here.  In  [5],  motion  primitives  are  used 
to  handle  multiple  agent  motion  planning  problem.  In  [6], 
distributed  cooperative  strategies  for  a  group  of  robotic 
manipulators  was  proposed  using  neural  networks.  This 
work  accounted  for  control  input  uncertainties.  In  [7],  the 
multiagent  motion  planning  problem  is  setup  as  a  Markov 
decision  process  (MDP)  with  constraints  and  uncertainty, 
where  the  coupling  between  agents  only  occur  in  the  re¬ 
wards  and  constraints.  The  optimal  move  in  the  presence 
of  constraints  are  computed  using  optimization  algorithms, 
however,  the  dynamics  of  the  agents  is  not  considered.  In  [8], 
a  cooperative  control  technique  is  proposed  which  creates 
a  communication  graph.  Under  certain  assumptions,  the 
dynamics  of  non-holonomic  agents  are  modeled  as  kinematic 
point  robot  chains,  and  the  cooperative  laws  are  defined  on 
this  reduced  model  of  multi-agent  system  (MAS).  In  [9],  the 
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author  has  developed  a  decentralized  multiagent  RRT,  and  a 
merit-based  token  passing  coordination  strategy.  With  respect 
to  team  formation  strategies,  uncertainties  were  considered 
in  communications,  but  not  in  the  motion  model.  In  [10],  the 
authors  developed  a  non-cooperative  approach  for  collision 
detection  and  avoidance.  Genetic  algorithm  and  Monte  Carlo 
techniques  were  used  to  address  uncertainties  in  motion 
model.  In  [11],  an  online  receding  horizon  motion  planner  is 
developed  for  solving  the  decentralized  navigation  problem 
for  multiple  agents.  Artificial  potential  fields  and  sliding 
mode  control  techniques  are  used  to  handle  uncertainties  in 
the  motion  model. 

Some  other  work,  involving  multiple  agents  coordination 
without  uncertainty  in  the  system,  are  worth  mentioning 
here.  In  [12],  recent  work  related  to  the  multiple  agent 
motion  coordination  problem  is  summarized.  In  [13],  the 
distributed  path  consensus  (DPC)  algorithm  is  extended  to 
multiple  task  execution.  The  DPC  was  developed  to  address 
multiagent  motion  planning  problem  with  time  parameterized 
constraints  on  the  distances  between  a  pair  of  robots.  In 
this  work,  the  state  space  is  combined  with  a  task  graph, 
and  optimal  trajectories  for  the  agents  are  searched  in  this 
graph  using  a  heuristic  function.  In  [14],  the  motion  planning 
problem  for  multiple  agents  was  addressed  using  coordina¬ 
tion  graphs  and  decoupled  planning.  The  agent  works  on 
path  following  and  obstacle  avoidance.  In  [15],  the  authors 
developed  a  navigation  function  methodology  for  decen¬ 
tralized  navigation  which  leads  to  reduced  computational 
complexity  and  increased  robustness  to  agent  failures.  They 
deal  with  holonomic  agents.  In  [16],  the  authors  were  able 
to  decompose  global  payoff  function  for  multiple  agent 
scenario  into  local  terms  using  a  coordination  graph.  They 
used  reinforcement  learning  techniques,  known  as,  sparse 
cooperative  Q-learning,  and  used  edge-based  decomposition 
of  the  actions  in  the  coordination  graph.  In  [17],  coordination 
between  agents  is  developed  using  the  theory  of  collective 
intelligence  through  probability  collectives. 

The  work  mentioned  above  solves  the  coordination  prob¬ 
lem  for  multi-agents,  using  various  different  methodologies. 
However,  a  systematic  framework  to  incorporate  motion 
model  uncertainty,  non-trivial  non-linear  dynamics  of  robots, 
and  solve  for  high-dimensional  state-space  systems,  is  still 
missing  in  these  efforts.  In  the  work  presented  in  this  paper, 
a  systematic  hierarchical  approach  is  presented  to  solve 
the  multiagent  motion  planning  problem  in  the  presence 
of  motion  model  uncertainty,  stochastic  maps,  and  with 
non-trivial  agent  dynamics.  We  solve  a  multi-agent  Markov 
decision  process  (MMDP)  by  decoupling  the  problem  using 
a  hierarchical  planning  structure  and  multiple  MDPs.  In  this 


paper,  we  propose  to  solve  the  coordination  problem,  be¬ 
tween  agents,  by  developing  a  passive  coordination  strategy, 
using  the  Generalized  PRM  (GPRM)  [1,  ],  developed  by 

the  authors  for  the  single  agent  feedback  motion  planning 
problem,  in  conjunction  with  a  well  proven  multiple  traveling 
salesman  problem  solution  methodology  [1  ],  for  a  class  of 
multiagent  systems. 

In  section  II,  the  framework  developed  by  the  authors 
to  handle  single  agent  motion  planning  problem  under 
process  uncertainty,  with  non-linear  dynamics,  and  in  high 
dimensional  configuration  space,  is  briefly  presented  [2].  In 
section  III,  the  actual  multi-agent  motion  planning  problem 
we  intend  to  solve  in  this  paper,  is  proposed.  In  section  IV, 
the  solution  methodology  is  presented.  In  section  V,  the 
solution  of  the  “Routing  Problem”  is  detailed.  In  section  VI, 
results  from  simulations  are  presented. 

II.  Motion  planning  for  a  single  agent  under 

PROCESS  UNCERTAINTY 

The  motion  planning  problem,  for  a  single  agent,  is  to  find 
a  collision  free  path  for  a  robot  in  a  given  obstacle  space. 
In  the  presence  of  stochastic  model  uncertainty,  there  is  a 
need  for  feedback  control,  which  then  can  be  associated  with 
a  probability  that  the  robot  reaches  the  goal  without  hitting 
the  obstacles.  Generalized  Sampling  Based  Algorithms  [1,2] 
were  introduced,  by  the  authors,  to  address  the  problem  of 
feedback  motion  planning  in  such  constrained  work  spaces. 
The  notion  of  collision  avoidance  and  collision-free  paths  as 
the  solution  to  the  motion  planning  problem,  can  no  longer  be 
satisfied,  and  therefore  the  above  criteria  need  to  be  replaced 
by  a  solution/  path  with  a  high  probability  of  success.  The 
motion  planning  problem  is  then  re-framed  as  :  To  solve  the 
motion  planning  problem  in  the  presence  of  stochastic  maps , 
and  model  uncertainty \  generate  a  feedback  solution  with  a 
probability  of  success  above  an  a-priori  specified  probability , 
Pmin  • 

A.  Generalized  PRM  :  GPRM 

If  the  uncertainties  in  the  robot  model  and  environment 
can  be  modeled  probabilistically,  the  robot  motion  planning 
problem  can  be  formulated  as  Markov  Decision  Problem 
(MDP).  These  MDPs  are  computationally  intractable  for 
anything  but  small  state/  control  spaces,  and  especially  hard 
to  solve  in  continuous  state  and  control  spaces.  Hierarchical 
Methods  can  be  used  to  break  down  the  complexity  of  the 
problem.  The  Generalized  Probabilistic  Roadmaps  (GPRM) 
[2,  1],  is  a  sampling  based  hierarchical  method  which  extends 
the  Probabilistic  Roadmaps  (PRM)  [19]  technique  for  deter¬ 
ministic  path  planning,  to  systems  with  stochastic  model  and 
map  uncertainty.  GPRM  incorporates  feedback  controllers 
into  the  topological  graph  construction  phase. 

The  authors  have  also  developed  adaptive  sampling  tech¬ 
nique,  termed  as  “adaptive  GPRM”  (AGPRM),  [3,  20]  to 
increase  the  efficiency  and  overall  success  probability  of 
these  planners,  especially  in  high  dimensional  spaces.  The 
technique  have  been  implemented  on  high-dimensional  n- 
link  manipulators,  with  up  to  8  links.  The  results  demonstrate 


the  ability  of  the  proposed  algorithm  to  handle  the  feedback 
motion  planning  problem  for  highly  non-linear  systems,  in 
very  high-dimensional  state  spaces. 

In  this  paper,  this  work  is  extended  to  solving  stochastic 
motion  planning  problem  for  a  class  of  heterogeneous  multi¬ 
agent  systems. 

III.  Multi-agent  systems 

Multi-agent  system  (MAS)  consists  of  multiple  agents 
which  execute  actions  and  influence  their  surroundings.  Each 
agent  receives  observations  and  selects  actions  individually, 
but  it  is  the  resulting  joint  action  which  influences  the 
environment  and  generates  the  reward1  for  the  agents.  This 
has  extremely  important  consequences  on  the  characteristics 
and  the  complexity,  of  the  problem. 

A.  Coordination  Problem 

The  multi-agent  motion  planning  problem  in  presence  of 
process  uncertainty  and  stochastic  maps  can  be  posed  as  an 
multi-agent  Markov  decision  process  (MMDP).  In  traditional 
methods  of  solving  an  MMDP,  the  problem  is  treated  as  a 
single  large  MDP  and  standard  solution  techniques  available 
for  MDPs  are  applied,  [21].  The  goal  of  solving  an  MMDP 
should  be  that  of  finding  the  best/optimal  policy  for  the 
system  of  agents.  However,  actions  are  taken  at  the  individual 
level  of  the  agents,  and  thus,  it  should  be  ensured  that, 
using  limited  communication,  the  combined  actions  of  all 
the  agents  should  result  in  an  optimal  policy  for  the  system 
of  agents.  The  problem  of  identifying  individual  policies  for 
each  agent,  which  results  in  an  optimal  joint  policy,  is  called 
the  coordination  problem. 

Researchers  working  on  solving  this  problem  have  come 
up  with  possible  solutions  such  as  coordination  graphs  (CGs) 
[22]  and  max-plus  algorithm  in  conjunction  with  CGs  [2-  ]. 
In  [2  ],  the  solution  to  cooperative  action  selection  for  a 
system  of  agents  (or  coordination  problem)  is  solved  by  con¬ 
structing  a  coordination  graph ,  and  optimizing  over  it  using 
the  variable  elimination  algorithm.  In  [23],  the  researchers 
proposed  an  improved  optimization  technique,  the  max-plus 
algorithm,  which  replaces  the  variable  elimination  procedure 
in  optimizing  over  the  coordination  graphs. 

B.  A  Class  of  Multi-Agent  Problems  in  the  Presence  of 
Uncertainty 

We  want  to  solve  the  multi-agent  motion  planning  prob¬ 
lems,  in  the  following  scenario  : 

•  m  agents,  with  m  initial  configurations  g7  = 
{qh,qh,...,qIm}, 

•  n  goal  locations  qG  =  {qGl ,  qGl , . . . ,  qGn  }, 

•  Process  uncertainty2  present  in  robot  motion  model, 

•  Environment  given  by  a  stochastic  map,  i.e.,  static 
obstacle  probabilities. 

The  problem  statement  can  be  stated  as  :  Given  a  stochastic 
map  with  static  obstacle  probabilities,  a  system  of  m  hetero¬ 
geneous  robots  each  equipped  with  perfect  state  sensors,  the 

^ost  of  transition 

2  also  called  motion  model  uncertainty 


initial  configurations  (qj)  of  all  the  robots,  a  set  of  n  final  goal 
configurations  (qG),  to  solve,  the  motion  planning  problem  for 
the  set  of  robots  in  the  presence  of  process  uncertainty  such 
that  at  least  one  robot  visits  each  of  the  goal  locations,  while 
the  total  cost  of  operation  for  the  system  is  minimized 

In  order  to  solve  the  multi-agent  problem,  the  following 
sub-problems  have  to  be  solved  : 

•  Routing  problem  :  The  number  of  agents  and  number 
of  goal  locations  might  not  be  same,  i.e.  m^n  (general 
case).  The  goal  locations  are  different  in  number,  some 
agents  will  have  to  go  to  more  than  one  goal  and  some 
might  not  have  to  go  to  any  goal.  Hence,  with  the  given 
scenario,  one  has  to  solve  a  routing  problem  (or  the 
coordination  problem  as  discussed  in  subsection  III- A) 
for  the  multi-agent  system. 

This  is  the  problem  of  identifying  which  agents  will  go 
to  which  goal  locations.  Hence,  given  m  agents  and  their 
initial  configurations,  q:  =  {#/(/)},  i  =  and  n 

target  final  configurations,  qG  =  {qd}) },  i  = 
and  given  m/n  (general  case),  how  to  determine  which 
set  of  goals  any  given  agent  will  go  to. 

•  Heterogeneous  and  homogeneous  agents  :  In  a  general 
multi-agent  system,  there  exist  heterogeneous  agents, 
i.e.  agents  with  different  capabilities  (or  multiple  types 
of  agents).  Thus,  homogeneous,  as  well  as  the  heteroge¬ 
neous  agent  scenario  must  be  addressed  in  a  multi-agent 
motion  planning  problem. 

IV.  Solution  Approach  to  Multiagent  Motion 
Planning  Problem  in  Presence  of  Uncertainty 

This  section  discusses  the  solution  approach  to  solve  the 
sub-problems  of  the  multi-agent  motion  planning  problem. 

A.  Routing  Problem 

The  routing  problem,  in  a  deterministic  framework,  has 
been  solved  extensively  in  the  traveling  salesman  problem 
(TSP)  research  community  [2^  ].  The  generalized  multi-agent 
routing  problem,  in  deterministic  framework,  has  been  posed 
as  a  multiple  traveling  salesman  problem  (MTSP)  [25,  18], 
and  there  have  been  multiple  approaches  to  solve  it.  We 
aim  to  use  existing  multiple  agents  routing  problem  solution 
techniques  developed  in  [18],  and  synergistically,  apply  it 
along  with  the  GPRM  to  the  multi-agent  systems,  in  presence 
of  process  uncertainty,  and  stochastic  maps.  The  solution 
of  GPRM,  between  any  pair  of  goal  locations3  for  any 
given  agent,  generates  transition  costs,  and  probabilities.  The 
MTSP  algorithm,  uses  these  costs  and  transition  probabilities 
from  the  GPRM,  to  solve  the  “routing  problem”,  and  hence, 
solve  the  multi-agent  motion  planning  problem  under  uncer¬ 
tainty.  We  expect  that  this  generalized  technique,  termed  -  the 
multi-agent  adaptive  sampling  based  generalized  probabilis¬ 
tic  roadmaps  (MAGPRM),  will  help  us  solve  the  feedback 
motion  planning  problems  in  high  dimensional  state  spaces, 
under  uncertainty,  in  the  multiple  agent  scenario. 

3  Goal  locations  here  includes  the  initial  configurations  and  the  desired 
configurations 


B.  Homogeneous  and  Heterogeneous  Agents 

In  the  case  of  all  agents  being  homogeneous  in  dynamics 
and  capabilities,  a  single  roadmap  (GRPM)  is  adequate,  for 
solving  the  motion  planning  problem  for  an  agent  between 
any  pair  of  the  goal/start  locations.  For  heterogeneous  agents, 
solving  the  motion  planning  problem  will  involve  construct¬ 
ing  roadmaps  (GPRMs)  for  every  type  of  agent  present  in 
the  system,  (see  Figure  1) 


Fig.  1.  Multiple  Agents  and  the  number  of  associated  GPRMs 


V.  The  Solution 

In  this  section  we  develop  the  solution  to  multi-agent 
motion  planning  problem  by  developing  it  for  the  sub¬ 
problems  of  routing  the  agents  and  having  heterogeneous 
agents  in  a  multi-agent  scenario.  We  develop  a  synergistic 
coupling  of  MTSP  solutions  with  GPRM  to  solve  the  routing 
problem. 

A.  Definitions 
1)  General: 

:  The  configuration  space  of  the  agent.  A  configu¬ 
ration  is  given  by  q,  i.e  a  generalized  position. 

:  The  state-space  of  the  agent.  A  state  is  given  by, 
v  =  (q,q),  i.e.  comprised  of  generalized  position 
and  generalized  velocity. 

V  :  ith  landmark,  i.e.  a  sample  in  state-space  ( ll  G  3tf). 
:  Set  of  landmarks,  i.e.  ££  =  {/*},Vz,  on  a  given 
stochastic  map. 

Sf  :  Set  of  start  and  goal  locations4  in  a  multiple 
agents  scenario  on  a  given  stochastic  map.  This 
set  containing  these  start  and  goal  locations,  is  a 
sub- set  of  the  set  of  landmarks,  i.e.  C  ££ 
srf  :  Set  of  all  agents,  {a/},  V/.  (a;  is  the  ith  agent) 

:  Set  of  controls.  (wGf) 

:  Set  of  lower  level  controllers,  (p  G 

4  landmarks 


2)  Controls: 

fi(-)  :  The  lower  level  (Lev el \  in  MAGPRM  in  Figure  2) 

controller  for  the  agents.  In  MAGPRM  it  is  a  feed¬ 
back  controller,  parametrized  using  the  landmark. 
Also  jUG  /  and  : 

/i(-)  :  3C  i  y  <f/ 

n(-)  :  Policy  operator  at  Lev  eh  of  MAGPRM  (Figure  2), 

i.e.  solution  of  GPRM  for  a  single  agent. 

7t(-) 

Given  a  goal  landmark,  lgoal ,  n  is  a  solution  pro¬ 
vided  by  GPRM.  This  solution  is  dependent  on  lgoal 
and  hence  the  operator  n  can  be  rigorously  written 
as  follows: 

7 r(-  ;  /*“') 

y(-)  :  An  operator  at  Levels  of  MAGPRM. 

y(-)  :  ^  x  Sf  i-a  Gf 

Given  a  particular  agent  <2;  G  <£/,  and  the  location(G 
Sf)  of  say  gG^,  the  operator  outputs  the  next 
goal  location  for  i.e.  g'  G  .  This  g'  parametrizes 
the  Lev  eh  7t(-)  operator,  i.e.: 

7 r(-  ;  g')  :  jSf  ha  Jf 

Furthermore  in  terms  of  goal  landmark,  lf)al  G  ££ 
for  the  Ith  agent,  the  location  g'  G  Sf  where  Sf  C  JSf , 
is  given  by  : 

g/  _  igoal  ^ 

;r(-  ;  /foa/)  ;  j£f  i-a  .y#,  for  ith  agent 

This  operator  provides  the  agent,  starting  at  a  start 
location  (G  Sf),  the  goal  location  (G  Sf)  it  is  sup¬ 
posed  to  go,  and  a  sequence  of  feedback  controllers 
(G  to  reach  there. 

B.  Solution  of  MTS P 

In  [18],  a  solution  methodology  is  proposed  for  the  gen¬ 
eralized  MTSP  problem  formulation.  The  solution  method¬ 
ology  involves  two  transformation  steps: 

•  Converting  a  generalized  MTSP  to  a  one-in-a-set  ATSP 
(where  ATSP  :  Asymmetric  Traveling  Salesman  Prob¬ 
lem). 

•  Converting  a  one-in-a-set  ATSP  to  a  single  ATSP.  This 
is  done  using  the  Noon-Bean  Transformation  [26]. 

The  generalized  MTSP  is  posed  as  a  single  ATSP  by  the 
proposed  transformations  which  involve  cost  modifications. 
The  single  ATSP  can  be  solved  using  the  well-known  TSP 
solver,  Lin-Kernighan  heuristic  (LKH)  [24].  Solving  the 
single  ATSP  and  working  backwards  gives  the  solution  to 
the  generalized  MTSP.  Details  of  the  algorithm  developed 
can  be  seen  in  [18]. 


C.  Solving  Multi-Agent  Systems  in  Presence  of  Uncertainty 

The  GPRM  was  developed  [2]  as  a  hierarchical  approach. 
The  proposed  algorithm,  MAGPRM,  for  solving  the  multi¬ 
agent  motion  planning  involves  the  introduction  of  a  new 
level  in  the  existing  hierarchy.  The  lowest  level  (say  Level\ 
in  Figure  2)  solves  the  motion  planning  problem  between  one 
landmark  to  another  and  inherently  generates  the  cost  of  tran¬ 
sition  and  transition  probabilities  between  two  landmarks. 
These  transition  probabilities  and  costs  induced  an  abstract 
MDP,  on  the  discrete  set  of  landmark  states,  i.e.  the  higher 
level  (Level 2  in  Figure  2).  We  use  Dynamic  Programming  to 
solve  the  Leveh  abstract  MDP  with  the  transition  cost  and 
probabilities  generated  by  the  Level 


Fig.  2.  Depicting  Hierarchical  Planning  in  Levels  (for  Multi  Agents) 

In  a  multi-agent  motion  planning  scenario,  having  m 
agents  and  n  goal  locations,  an  additional  problem  needs 
to  be  solved,  namely  the  routing  problem.  In  order  to  solve 
the  routing  problem  we  introduce  Leveh  which  comprises  of 
a  graph  whose  vertices  are  high  level  goal  locations  (g  G  ), 
i.e.  the  m  agents’  initial  locations  and  the  n  desired  goal 
locations.  These  set  of  goal  locations  (Sf)  is  a  sub-set  of  the 
set  of  landmarks  (i.e.,  Sf  C  Jjf.  See  Figure  2).  At  Leveh , 
these  goal  locations  (Sf)  are  treated  as  landmarks.  Using  Jjf 
on  Leveh ,  multiple  GPRMs  (i.e.,  generalized  roadmaps)  are 
constructed  over  which  the  ‘single-agent’  motion  planning 
problem  is  solved.  Whereas,  using  £f,  a  MAGPRM  (i.e.,  a 
multi-agent  roadmap)  is  constructed  on  Leveh  over  which  the 
‘multi-agent’  motion  planning  problem  is  solved.  The  edges 
of  the  graph  in  Leveh  are  abstract  connections  from  “agent 
locations  to  goal  locations”  and  “goal  to  goal  locations”. 
“Agent  locations  to  agent  locations”  connections  are  avoided 
as  a  part  of  assumption  that  an  agent  should  not  go  to  another 
agent’s  location. 

Using  GPRMs,  the  cost  of  transition  and  the  path  probabil¬ 
ity  associated  with  all  these  edges  in  Leveh ,  can  be  computed 
(Figure  2).  These  costs  and  transition  probabilities  associated 
with  every  edge  is  specific  to  agents.  The  number  of  GPRMs 
that  needs  to  be  solved  are  2 n(m-\-n)5.  Using  the  computed 
costs6  and  after  two  cost  transformations,  the  prospective 

5 These  are  the  number  of  edges  that  needs  to  be  evaluated,  m  agents  to 
n  goals  and  vice  versa  gives  2 mn  edges,  n  goals  to  n  goals  gives  2 n2  edges 
and  hence  the  total  is  2 n(m  +  n) 

6The  MTSP  algorithm  only  takes  costs  as  input,  the  costs  computed  by 
GPRM  do  take  into  account  the  transition  probabilities. 


MTSP  algorithm  [18]  via  the  LKH  solver  solves  the  routing 
problem ,  i.e.  allotment  of  sequence  of  goal  locations  to 
different  agents. 

In  this  MTSP  level,  i.e.  Levels  of  MAGPRM,  the  solution 
is  the  operator  y(-).  For  each  agent,  the  operator  outputs  a 
sequence  of  goal  locations  to  be  visited.  We  mention  here 
that  unlike  in  Level\  and  Lev  eh,  the  solution  (y(-))  is  not  a 
feedback  policy  at  Levels,  since  if  the  sequence  is  broken  by 
the  agents  due  to  some  plausible  reason,  the  policy  does  not 
remain  optimal,  and  the  paths  of  the  different  agents  need  to 
be  replanned  using  the  MTSP  solver. 

D.  Multi-Agent  GPRM  ( MAGPRM )  Algorithm 

The  proposed  methodology  of  solving  the  multi-agent 
motion  planning  problem  is  summarized  in  algorithm  1. 


Algorithm  1:  Multi- Agent  GPRM  (MAGPRM) 

Data:  Set  of  agents  (srf),  start  locations  xo ,  goal 
locations  xg,  pmin  for  the  environment 

1  for  ith  agent  at  start  location  xot  G  vo  do 

2  for  jth  goal  location,  xgj  G  xg  do 

3  while  ps(xoi  xgj)  <  pmin  do 

4  if  ith  agent's  type  already  evaluated  then 

5  Use  already  existing  roadmap  to  build 
|_  and  connect  further; 

6  Construct  AGPRM,  parametrized  with  goal 
location  xgj  and  agent-type  of  ith  agent; 

7  Construct  a  cost  of  transitions  matrix  for  each 
agent-type  (i.e.  cost  of  transitions  between 
agents  GG  goals  and  goals  GG  goals); 

8  Solve  the  routing  problem  for  each  agent,  using  the 
above  generated  costs  in  prospective  MTSP  algorithm; 

The  proposed  algorithm  solves  the  general7  multiple  agent 
motion  planning  problem  in  presence  of  process  uncertainty 
and  stochastic  maps. 

The  algorithm  constructs  a  roadmap  using  AGPRM  be¬ 
tween  each  pair  of  start  and  goal  locations.  The  loop  at 
line  3  depicts  this,  i.e.,  with  each  combination  of  start  and 
goal  locations,  a  new  AGPRM  (which  is  parametrized  at  the 
current  goal  location),  is  solved. 

In  the  case  of  multiple  agents  of  same  type,  lines  4-5 
ensure  that  the  existing  roadmap,  is  either  extended  further, 
or  is  used  to  find  a  solution,  between  the  ith  agent’s  location 
and  the  7th  goal  location. 

In  the  case  of  heterogeneous  agents,  different  roadmaps  for 
different  types  of  agents  have  to  be  constructed  and  hence,  is 
more  computationally  intensive.  The  landmarks  might  still  be 
shared  but  transition  costs  and  transition  probabilities  calcu¬ 
lations  will  involve  running  the  simulations  and  constructing 
a  different  roadmap  each  time  a  new  type  of  agent  comes  into 
the  system.  Line  7  emphasizes  this  feature  of  the  algorithm. 

Once  the  cost  of  transitions  between  each  start  and  goal 
locations  is  computed,  the  routing  problem  is  solved  using 

7  Involving  heterogeneous  agents 


the  prospective  MTSP  algorithm.  The  solution  of  MAGPRM 
is  the  sequence  of  goal  locations  to  be  visited  by  individual 
agents,  which  takes  into  account  the  process  uncertainty 
in  the  dynamics  of  the  agents,  and  their  traversal  along  a 
stochastic  map. 

VI.  Results  and  Discussion 

In  this  section,  we  will  detail  the  application  of  the 
multi-agent  GPRM  (i.e.  MAGPRM)  algorithm  to  scenarios 
with  homogeneous  and  heterogeneous  agents.  The  agents 
involved  in  these  numerical  experiments  are  a  unicycle  and  a 
simplified  three  dimensional  vehicle,  mimicking  a  helicopter. 

A.  Vehicle  Models  Used 

The  numerical  experiments  done  using  MAGPRM  in¬ 
volves  the  following  two  types  of  robot  models  used  along 
with  their  specific  feedback  controllers. 

1 )  Nonholonomic  Unicyle  robot:  The  equations  of  motion 
are  given  by  : 

A  =  vcos  0  +  wx  (1) 

y  =  v  sin  6  +  wy  (2) 

6  =  co  +  wq  (3) 

where  (x,y,  0)  represents  the  pose  of  the  robot,  the  velocity 
v,  and  the  angular  velocity  CO ,  represents  the  control  inputs 
to  the  problem,  and  wx,wy  and  wq  are  the  uncorrelated  noise 
terms  for  the  different  states  of  the  robot.  A  sampled  pose  is 
in  the  (x,y,Q)  spaces  and  the  local  feedback  controller  used 
to  stabilize  the  robot  about  these  sampled  equilibrium  config¬ 
urations  is  given  by  [27],  a  dynamic  feedback  linearization- 
based  controller. 

2)  Simplified  3D  helicopter  robot:  A  simplified  three- 
dimensional  helicopter  robot  is  constructed  using  a  Dubins 
car  for  inplane  motion  (as  in  [1])  and  a  decoupled  double 
integrator  in  the  z— direction.  Hence  the  dynamics  of  this 
simplified  robot  can  be  given  by: 


x  =  vcos  6  +  wx 

(4) 

y  =  v  sin  0  +  wy 

(5) 

0  =  (Q  +  Wq 

(6) 

Z  =  uz  +  Wz 

(7) 

Apart,  from  the  definitions  discussed  above,  uz  is  the  input 
force  in  z— direction,  and,  wz  is  uncorrelated  noise  in  the 
same  direction.  Our  sampled  poses  are  in  the  (x,y,  0,z,z) 
spaces.  The  local  feedback  controllers  can  stabilize  the  robot 
about  any  of  the  equilibrium  configurations  sampled. 

A  dynamic  feedback  linearization-based  controller  design 
is  chosen  as  in  [1]  for  the  Dubins  car  model,  for  in-plane 
motion.  An  LQR  based  feedback  controller  is  designed  for 
the  double  integrator  in  z— direction  as  in  [2]. 

The  values  of  wx,wy,wg  and  wz  are  similar  to  those 
dicussed  in  [1,  2]. 

B.  Homogeneous  Agents 

In  these  numerical  experiments,  multiple  homogeneous 
agents8  starting  at  different  locations  on  a  stochastic  map 

8  All  agents  are  Dubins  car 


were  supposed  to  cover  a  given  number  of  goal  locations. 
As  the  agents  are  homogeneous,  the  cost  of  transition  and 
the  probability  of  transition  from  one  landmark  to  another 
is  the  same,  given  that  all  agents  are  working  with  the  same 
map  and  the  same  sampled  landmarks. 

The  result  of  our  simulation  experiments  are  shown  in  Fig¬ 
ure  3  and  Figure  4.  Figure  3(a),  Figure  4(a)  and  Figure  4(b) 
show  three  different  cases,  i.e.  different  number  of  agents 
starting  at  different  start  locations  and  that  have  to  visit  a 
different  number  of  goal  locations. 

Figure  3  depicts  a  case  in  which  all  the  start  locations  for 
the  robots  were  constricted  to  a  smaller  region  compared  to 
the  spread  of  the  goal  locations.  Figure  3(a)  shows  the  solu¬ 
tion  of  MAGPRM  (i.e.  at  Level 2  of  MAGPRM)  in  terms  of 
the  goal  locations  to  be  visited  by  the  active9  agents  and  the 
various  landmarks  used  to  navigate  through  those  assigned 
goal  locations.  Figure  3(b)  shows  the  actual  trajectories  (i.e. 
at  Lev  el  1  of  MAGPRM)  of  the  active  agents  based  on  the 
dynamics  and  the  corresponding  feedback  controller. 

The  solution  shows  that  MTSP  (i.e.  at  Levels  of  MAG¬ 
PRM)  is  driven  by  space  partitioning.  And  hence,  there  are 
some  agents  who  do  not  move  from  their  initial  locations. 


(b)  MAGPRM  -  Trajectories  for  individual  vehicles 

Fig.  3.  MAGPRM  Solutions  and  Trajectories  -  10  vehicles  and  30  goal 
locations 

Figure  4(a)  and  Figure  4(b)  show  results  obtained  by 
MAGPRM,  depicting  coverage  of  goal  location  by  the  agents 

9 In  MAGPRM  solution  not  all  agents  need  to  move  and  hence  active 
agents  are  the  ones  which  have  been  assigned  atleast  one  goal  location. 
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(a)  MAGPRM  -  5  vehicles  and  40  final  locations 
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(b)  MAGPRM  -  10  vehicles  and  40  final  locations 
Fig.  4.  MAGPRM  Solutions 


starting  from  different  start  locations.  Both  the  solutions  have 
40  goal  locations  and  the  number  of  agents  are  5  and  10 
respectively.  The  solution  of  MAGPRM  in  these  solutions 
also  show  the  space  partitioning  behavior. 

C.  Heterogeneous  Agents 

In  these  set  of  experiments,  heterogeneous10  agents  are 
present  in  the  map.  A  three  dimensional  static  stochastic  map 
is  used  for  these  simulations.  In  each  of  these  simulations 
there  are  several  3 -dimensional  and  2-dimensional  goal  lo¬ 
cations.  The  Dubins  car  can  only  cover  the  2-dimensional 
goal  locations,  and  the  simplified  3D  helicopter  robot  can 
traverse  to  both  2-dimensional,  as  well  as,  3 -dimensional 
goal  locations. 

The  initial  set  of  landmarks  sampled  were  2-dimensional 
goals.  The  connections  of  3 -dimensional  goal  locations  was 
facilitated  using  AGPRM  [3],  hence  the  solutions  shown 
below  has  less  number  of  3 -dimensional  sampled  landmarks. 

Figure  5  and  Figure  6  shows  the  MAGPRM  solution, 
for  different  sets  of  start  and  goal  locations.  The  Dubins 
car  covers  all  the  2-dimensional  goal  locations,  while  the 
simplified  3D  helicopter  robot  covers  only  the  3 -dimensional 
goal  locations.  Figure  5(a)  shows  the  solution  of  MAGPRM 

10A  Dubins  car  and  a  simplified  3D  robot 


and  Figure  5(b)  shows  the  actual  trajectories  of  the  robots. 
A  partitioning  of  space  is  again  visible. 


Fig.  6.  MAGPRM  with  Dubins’  Car  and  3D  Vehicle  with  5 -Obstacles  : 
Case  2 


(a)  MTSP  solution  for  the  robots 


(b)  Trajectories  of  the  robots 

Fig.  5.  MAGPRM  with  Dubins’  Car  and  3D  Vehicle  with  5 -Obstacles  : 
Case  1 

The  heterogeneous  agents  case  discussed  in  this  section 
depicts  the  power  of  MAGPRM.  The  stochastic  decision 
making  problem  in  presence  of  heterogeneous  agents,  for 
which  the  computation  of  the  cost  of  transitions  are  different, 
for  each  agent  type,  is  solved.  In  order  to  solve  the  het¬ 
erogeneous  agents  problem,  the  MAGPRM  utilizes  multiple 
GPRMs  constructed  on  the  underlying  state- space  of  each 
type  of  agent. 

VII.  Conclusion 

In  this  paper,  we  have  presented  a  solution  to  the  motion 
planning  problem  under  uncertainty  for  multiple  agents.  In 
order  to  solve  the  overall  problem,  in  conjunction  with  our 
solution  methodology  for  a  single  agent  (GPRM),  a  routing 
problem  needs  to  be  solved.  The  routing  problem  is  solved 
using  an  existing  solution  to  the  multiple  traveling  salesman 
problem.  The  MTSP  solution  methodology,  in  conjunction 
with  GPRM,  results  in  the  MAGPRM  algorithm  that  solves 
the  motion  planning  problem  for  multiple  agents  in  pres¬ 
ence  of  process  uncertainty,  and  stochastic  maps.  Numerical 
experiments  were  performed  on  sets  of  homogeneous ,  and 
heterogeneous  agents ,  for  maps  of  different  difficulty  levels, 


with  different  number  of  start  and  goal  locations,  and  dif¬ 
ferent  number  of  agents.  Results  show  that  the  algorithm  is 
indeed  capable  of  solving  the  motion  planning  problem  for 
multiple  agents,  with  non-trivial  nonlinear  dynamics,  in  the 
presence  of  process  uncertainty  and  stochastic  maps. 
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